Add Neon implementation of `includes` by hazzlim · Pull Request #6076 · microsoft/STL · GitHub
Skip to content

Add Neon implementation of includes#6076

Merged
StephanTLavavej merged 2 commits into
microsoft:mainfrom
hazzlim:includes-neon-pr
Feb 11, 2026
Merged

Add Neon implementation of includes#6076
StephanTLavavej merged 2 commits into
microsoft:mainfrom
hazzlim:includes-neon-pr

Conversation

@hazzlim

@hazzlim hazzlim commented Feb 5, 2026

Copy link
Copy Markdown
Contributor

This PR adds a Neon path for the semi-vectorized implementation of includes.

Similar to #5590, it is a mixed bag of results. However, most significantly for this PR the new Neon path is slower than the Clang scalar codegen for 64-bit types. I am not sure if this was also the case for the x86 vectorization, as that PR looks as though it only reports MSVC numbers (?)

I guess we could avoid 64-bit vectorization - the tradeoff here would be improving Clang perf at the cost of MSVC perf.

@hazzlim hazzlim requested a review from a team as a code owner February 5, 2026 17:02
@github-project-automation github-project-automation Bot moved this to Initial Review in STL Code Reviews Feb 5, 2026
@hazzlim

hazzlim commented Feb 5, 2026

Copy link
Copy Markdown
Contributor Author

@AlexGuteniev

Copy link
Copy Markdown
Contributor

We didn't target Clang performance much in the past. We didn't build and run benchmark for Clang until #5533. I intended #5533 as a preparation for #5591 (the 6-byte color case in mismatch that can be vectorized only for Clang due to MSVC lacking the ability to detect such case). Whereas #5590 was later than that, Clang numbers weren't obtained.

Sure, the Clang numbers can be collected now. We can build with set CXXFLAGS=/D_USE_STD_VECTOR_ALGORITHMS=0 to revert to pre-vectorized state. This is a bit time consuming for the particular benchmark as there are a lot of cases (for a reason, it is an attempt to represent the variety of subset/"needle" distribution). But I can do that if needed.

@StephanTLavavej StephanTLavavej self-assigned this Feb 5, 2026
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Feb 5, 2026
Comment thread stl/src/vector_algorithms.cpp Outdated
@StephanTLavavej StephanTLavavej removed their assignment Feb 6, 2026
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Feb 6, 2026
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Feb 9, 2026
@StephanTLavavej

Copy link
Copy Markdown
Member

I'm mirroring this to the MSVC-internal repo. Please notify me if any further changes are pushed, otherwise no action is required.

@StephanTLavavej StephanTLavavej merged commit 4e9e747 into microsoft:main Feb 11, 2026
45 checks passed
@github-project-automation github-project-automation Bot moved this from Merging to Done in STL Code Reviews Feb 11, 2026
@StephanTLavavej

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants