find_symbols.h: implement AArch64/NEON SIMD path#104228
Conversation
Implements NEON code paths in the SIMD search loops in `find_symbols.h` (used by `find_first_symbols`, `find_first_not_symbols`, `find_last_symbols_or_null`, `find_last_not_symbols_or_null`, `splitInto`, etc.). On AArch64 the dispatch previously fell through to a scalar byte-by-byte loop. The NEON helpers (`neon_is_in`) mirror the SSE2 ones (`mm_is_in`): each search symbol is broadcast with `vdupq_n_u8` and OR-folded with `vorrq_u8` over per-byte `vceqq_u8` results. There is no direct equivalent of `pmovmskb` on AArch64, so the resulting 16-byte all-0/all-1 vector is compressed with the standard `vshrn_n_u16(..., 4)` trick into a 64-bit value where each input byte occupies a 4-bit nibble. The first / last matching position is then recovered as `__builtin_ctzll(mask) >> 2` / `__builtin_clzll(mask) >> 2`. Negative search inverts the equality vector with `vmvnq_u8` before compression. Tested with a standalone harness (mirrors `gtest_find_symbols` plus 5000 random fuzz iterations against a scalar reference, plus long-haystack and null-byte cases). On a 64MB buffer the NEON path is ~9-13x faster than the scalar fallback (e.g. single character: 2.78 -> 26.2 GB/s; 8 runtime symbols: 0.62 -> 7.93 GB/s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [06afc7d] Summary: ✅ AI ReviewSummaryThis PR adds an AArch64/NEON vectorized path for Final Verdict
|
Algunenano
left a comment
There was a problem hiding this comment.
Looks good. We could We overload maybe_negate to handle uint8x16_t and keep the same code structure but that's minor.
|
@groeneai, fix |
|
@alexey-milovidov — STID 2508-2e40 ( We posted a comprehensive analysis there earlier today (2026-05-06): #104183 (comment) TL;DR:
Opening a separate PR purely for the phi_squared manifestation would splinter what is one underlying bug. Will file the fix under #104183 once the offending combinator is pinpointed. Happy to upstream debug-build invariants in — @groeneai (worker session |
Refactors the NEON paths in find_symbols.h to use a uint8x16_t overload of maybe_negate, mirroring the structure of the SSE2 paths. No behavior change (the previous explicit `if constexpr (!positive) eq = vmvnq_u8(eq)` is the same operation). #104228 (review) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k overhead The previous NEON `find_first_symbols_sse2` runtime-needle path padded unused needle slots in `neon_is_in_prepare` with `vdupq_n_u8(0)` and unconditionally iterated all 16 needles in `neon_is_in_execute`. Two issues followed: 1. Correctness: a haystack containing `\0` would falsely match in the SIMD body whenever fewer than 16 needles were given, because the padding zeros compared equal to NUL bytes. The bug only manifests for haystacks at least 16 bytes long, so the existing `FindNotSymbols.NullCharacter` test (9-byte haystack) did not catch it. 2. Performance: the prepare phase ran 16 `vdupq_n_u8` broadcasts even when the haystack was shorter than 16 bytes (so the SIMD body never executed), and the execute phase did 16 `vceqq_u8` + 16 `vorrq_u8` per block even for tiny needle sets. This caused regressions on `trim_numbers` with short decimal strings against a 6-character needle, surfaced by perf CI on PR #104228. The fix: - `neon_is_in_prepare` only fills `num_chars` slots. - `neon_is_in_execute` takes `num_chars` and only iterates over real needles. - The prepare call is moved inside an `if (pos + 15 < end)` guard so it is skipped entirely for haystacks too short for the SIMD body. The extended `FindNotSymbols.NullCharacter` test exercises the SIMD body with a NUL byte at offset 16 to lock in the correctness fix.
For the runtime-needle variants of `find_first_symbols_sse2` and `find_last_symbols_sse2`, wrap the SSE2 main loop (and the `mm_is_in_prepare` call before it) in a guard so we don't pay the needle-preparation cost when the haystack is shorter than 16 bytes and the SIMD body would not execute. Mirrors the same guard already added to the AArch64/NEON path in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before this fix, the AArch64 NEON path was hot for sparse matches (URL parsers, etc.) but slowed down dense-match callers (CSV/TSV/JSON format readers, `trim` with custom characters): for each call the SIMD body did a full vector load + 3-4 `vceqq` + `vorrq` + `shrn` + `ctz`, which costs more than the handful of byte compares scalar would do when the match is within the first few bytes. CI perf comparison on PR #104228 surfaced this as +27% to +74% regressions on `count_from_formats`, `tsv_csv_nullable_parsing`, `formats_columns_sampling`, `json_input_format_part_fields`, and `trim_numbers` on `Performance Comparison (arm_release, master_head, 5/6)`. The fix is a short scalar pre-check (8 bytes) added before the SIMD body in all four AArch64 paths (`find_first_symbols_sse2` and `find_last_symbols_sse2`, both compile-time-template and runtime-needle variants). Matches that fall in those 8 bytes return without ever entering the SIMD loop, and the SIMD body still dominates for sparse matches further into the haystack. Microbenchmarks (64MB buffer, 10 iterations, AArch64 c8g): | Pattern | Scalar | NEON (before fix) | NEON (with fix) | |---|---|---|---| | CSV-like dense (1-7 byte gaps, 3 needles) | 1350 MB/s | 396 MB/s | 2130 MB/s | | TSV-like dense (1-7 byte gaps, 2 needles) | 1430 MB/s | 568 MB/s | 2660 MB/s | | URL-like sparse (~50 byte gaps, 2 needles) | 5950 MB/s | 7470 MB/s | 7000 MB/s | The 5000-iteration random fuzz harness from the PR description plus the gtest cases (including `FindNotSymbols.NullCharacter` against the 17-byte haystack that exercises the SIMD body) all pass. Perf failure on #104228
`trim_numbers` got 10-22% slower on AArch64 in the CI perf comparison for `PR=104228` because clang inlined the new NEON body into the per-row callers, adding extra branches in the short-haystack fast path and (for the runtime needle variant) reserving a 320-byte stack frame for `std::array<uint8x16_t, 16> needles` even when the SIMD body was skipped. Move the NEON work into `[[gnu::noinline]]` `find_first_symbols_neon_long` / `find_last_symbols_neon_long` (compile-time needles) and `find_first_symbols_neon_long_rt` / `find_last_symbols_neon_long_rt` (runtime needles), then tail-call them from the inline dispatcher only when `end - begin >= 16`. The inline body collapses back to the original scalar loop the compiler can auto-vectorise/unroll, and the runtime-needle path no longer reserves stack space for `std::array<uint8x16_t, 16>` per call. Also drop `neon_is_in_prepare` / `neon_is_in_execute` in favour of the already-existing `neon_is_in(bytes, symbols, num_chars)` overload: on AArch64 with clang the by-value `std::array` return forced the spilling. Local measurements on the same AArch64 instance, `numbers(10000000)` for each `trim_numbers` query (`master` is the previously installed ClickHouse binary, `PR-orig` is `b045a07c396` of this branch, `V7` is this commit): | query | master | PR-orig | V7 | | ------------------------------------------ | ------ | ------- | ---- | | `trim(toString(number))` | 103ms | 110ms | 104ms | | `ltrim(toString(number))` | 83ms | 101ms | 98ms | | `rtrim(toString(number))` | 86ms | 104ms | 99ms | | `trim(LEADING '012345' FROM ...)` | 139ms | 178ms | 144ms | | `trim(TRAILING '012345' FROM ...)` | 168ms | 188ms | 174ms | | `trim(BOTH '012345' FROM ...)` | 246ms | 275ms | 229ms | CI report: #104228 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds focused regression tests for `find_last_symbols_or_null(string_view,
SearchSymbols)` and `find_last_not_symbols_or_null(string_view,
SearchSymbols)` on haystacks long enough to enter the AArch64 NEON
reverse-search body (`end - begin >= 16`).
Covers the reverse index mapping (`__builtin_clzll(mask) >> 2`) for the
runtime-needle path:
- 32-byte haystack: matches in the most-recent SIMD chunk, in the
earlier chunk only, at the chunk boundary, multiple matches, and
a no-match case.
- 17-byte haystack: hits the 1-byte scalar tail of the reverse helper,
the first byte of the haystack, and an interior SIMD-body match.
- `\0` byte in the haystack: positive find_last must locate it when
`\0` is in the needle, and must not falsely match it otherwise.
- Negative variant: prologue boundary, SIMD-body interior match,
chunk-boundary case, all-needle (must return nullptr), and `\0` as
a non-needle byte that must be found.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`find_first_symbols_sse2` and `find_last_symbols_sse2` (runtime variants) call `mm_is_in_prepare`, which value-initialises a `std::array<__m128i, 16>` and only fills the first `num_chars` slots. The remaining slots stay at `_mm_setzero_si128()`, i.e. as if they contained the byte `\0`. The subsequent `mm_is_in_execute` ORs all 16 comparisons, so any `\0` byte in the haystack spuriously matches via an unused slot even when `\0` is not in the needle. The bug is observable for `num_chars < 5`; for larger needles the dispatcher routes to the SSE 4.2 path instead. Pad the unused slots with a repeat of `result[0]` so the comparison only ever matches bytes that actually appear in the needle. The setup cost is amortised across the SIMD loop and the per-chunk hot path is unchanged. Exposed by the new regression tests added in this PR for `find_last_symbols_or_null(string_view, SearchSymbols)` and `find_last_not_symbols_or_null(string_view, SearchSymbols)` against haystacks containing an embedded `\0`. The corresponding NEON paths are unaffected because they call `neon_is_in(bytes, symbols, num_chars)` directly and iterate exactly `num_chars` times. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104228&sha=726ffb1642471e8ebfe17e1e802ddc1e09490cd2&name_0=PR&name_1=Unit%20tests%20%28asan_ubsan%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@groeneai, investigate the chronic |
|
@alexey-milovidov investigation results — gdb.log analysis of Root cause: leaked local
|
| Mutex addr | Stuck workers |
|---|---|
| 0xff34114a6b00 | 16 |
| 0xff34114a6800 | 16 |
| 0xff34114a6c80 | 8 |
| 0xff34114a6980 | 8 |
| 0xff355d312e00 | 3 ← main thread is waiting on a worker of this pool |
| 0xff355d312c80 | 3 |
| 0xff355d312f80 | 2 |
All threads are named "ThreadPool" (the default — setThreadName runs inside the worker loop AFTER receiving a job, but these never received one). So the leaked pools can't be identified by thread name alone.
Why this matches the chronic 8-10/day rate
Local ThreadPool (= ThreadPoolImpl<ThreadFromGlobalPool>) instances must be destroyed (which calls finalize() → sets shutdown → notifies) before GlobalThreadPool::shutdown() runs. Some subsystem(s) hold a reference to a local ThreadPool past the scope guard at Server.cpp:1482. Likely candidates: long-lived singletons, registries, or thread-local resources that aren't drained in Server::main cleanup. Different stress runs hit different leakers, which is consistent with this being an umbrella affecting multiple subsystems (8-10/day spread across several specific patterns).
The absl::Mutex frames you saw
gRPC has its own WorkStealingThreadPool. 17 threads in this gdb.log are stuck in gRPC's WorkSignal::WaitWithTimeout (which uses absl::Mutex). They're not the blocker — they're just idle workers of gRPC's pool that haven't been told to exit because gRPC's EventEngine global is presumably one of the things that doesn't clean up before the Server::main scope guard. They're a symptom of the same broader pattern (subsystem lifetimes not properly ordered against GlobalThreadPool shutdown), not an independent deadlock.
Direction for fix
Two angles, complementary:
-
Targeted (per-subsystem): Identify the leaked local pools. The most practical way is to add a
LOG_INFOlistingname, num_threads, scheduled_jobsfor everyThreadPoolImplstill alive immediately before theGlobalThreadPool::shutdown()call (or inGlobalThreadPool::finalizeitself, before notify_all). When chronic hits land in CI, the log will name the leaker(s). Possibly enable behind a debug flag to avoid prod overhead. Then fix the lifetime in each named owner. -
Structural: A global registry of live
ThreadPoolImpl<ThreadFromGlobalPool>instances;GlobalThreadPool::shutdown()walks the registry, setsshutdown=trueand notifies each registered pool's condvar before joining its ownstd::threads. This breaks the dependency on subsystem destruction order. Slight added cost per pool construction/destruction (register/unregister + lock); modest. Downside: papers over real lifetime bugs.
Recommendation: do (1) first to identify the offenders, then (2) as a backstop if the list of leakers is long-tailed.
Not pushing a fix yet
Per our policy, no fix PR without local reproduction first. Will:
- Build a debug binary with the diagnostic LOG_INFO patch from (1) and run stress locally to confirm at least one specific leaker
- Open a fix PR for whichever subsystem(s) show up
- If the list is long, also open (2) as a backstop
Tracking under task 2026-04-01-ci-hung-check-deadlock-chronic-trunk-is. CIDB rate (14d): 8-10/day on master with a 78-hit spike on May 11 (concentrated in 2 PRs, possibly a single bad run). Will keep this thread updated.
|
@groeneai, investigate the failure: |
|
@groeneai, investigate the |
Address review feedback on PR #104228: `mm_is_in_prepare` documents the assumption that `num_chars >= 1`, but `SearchSymbols` still allows an empty needle. With `num_chars == 0` on the `__SSE2__` runtime path, the prepared vector remains effectively zero-filled, so `mm_is_in_execute` would match embedded `\0` bytes in SIMD chunks. Add an explicit `symbols.str.empty()` fast path to both `find_first_symbols_dispatch` and `find_last_symbols_dispatch` (the runtime-needle entry points) so the SIMD body is never invoked with an empty needle. The fast path matches the scalar semantics: for positive search nothing is found (returns end/nullptr); for the negative variant the first (or last, in the find-last case) byte qualifies. Cover the case with a new `FindSymbols.EmptyRunTimeNeedle` gtest that exercises short, empty, and long (>= 16 bytes, with embedded `\0`) haystacks across all six runtime-needle entry points.
Fixes the build failure on `Build (amd_debug)`, `Build (amd_asan_ubsan)`, `Build (amd_tsan)`, `Build (amd_msan)`, and `Build (llvm_coverage_build)` introduced by 65e6fb5: error: default initialization of an object of const type 'const SearchSymbols' without a user-provided default constructor `SearchSymbols` carries a `__m128i simd_vector` member under `__SSE4_2__` and an explicitly-defaulted `SearchSymbols() = default;` constructor, so default-initializing a `const` instance leaves `simd_vector` uninitialized and the compiler rejects the form. Switch the test to value-initialization (`SearchSymbols empty{}`), which zero-initializes the trivial members and is well-formed for `const`. The runtime semantics are unchanged: the new `symbols.str.empty()` fast path in `find_first_symbols_dispatch` / `find_last_symbols_dispatch` already bypasses the SIMD body, so the value of `simd_vector` is never observed for the empty-needle case the test exercises. CI report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104228/65e6fb580a6f05bb6b3822bb8a0902e6f3cccd91/build_amd_debug/build_clickhouse/build_clickhouse.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: 100.00% (109/109) · Uncovered code |
What we can see on the public side
SELECT countIf(pull_request_number=0) AS master, countIf(pull_request_number>0) AS prs
FROM default.checks
WHERE test_name LIKE '%test_statistics_cache%'
AND test_status IN ('FAIL','ERROR')
AND check_start_time > now() - INTERVAL 60 DAY
-- master=0, prs=1048, distinct_prs=15All 1048 PR failures are concentrated on 2 PRs, both unmerged: Notably your own PR #100185 makes every What we can't seeThe failing report URL is The shared code pathThe
Recent public master commits touching this path that landed after the last green state and could be the sync regressor:
What we need from youWithout the private CI artifact we're guessing. To root-cause this we need any of:
If (1) shows the failure is on the same |
PR ClickHouse#104228 added a NEON SIMD path to find_symbols.h. On AArch64 the inline dispatcher tail-calls an out-of-line ([[gnu::noinline]]) helper whenever end - begin >= 16; that helper ran an 8-byte scalar prefix and then the NEON loop. That guard is the wrong signal for format readers (CSV/TSV/JSON). They pass a haystack that ends at the far read-buffer end, but whose delimiter/quote is within the first few bytes of the current field. end - begin is large, so every short field pays a branch to out-of-line code where master used a tiny inline scalar loop. This is the regression reported in ClickHouse#106605 for the tsv_csv_nullable_parsing query 1 perf test on ARM (CSV Nullable(String), 1M short rows): about +18..30% on Neoverse-V2. Move the 8-byte scalar prefix out of the out-of-line helper and into the inline dispatcher, then tail-call the helper (now the NEON loop only) at begin + 8 (forward) / end - 8 (backward). Short fields whose delimiter is in the first 8 bytes now resolve fully inline with no SIMD call and no repeated work, while sparse long-haystack scans reach the NEON loop unchanged. The [[gnu::noinline]] attribute now guards only the SIMD loop, which is its intended purpose: keeping the vector loop out of short-string callers such as trim. Applied to all four AArch64 dispatch sites (find_first / find_last, compile-time and runtime needles). x86 paths are unchanged. Validated by cross-compiling for aarch64 (clang-21) and running the NEON code under qemu-aarch64: 2.4M randomized/exhaustive correctness checks against a scalar reference with 0 mismatches, and the existing gtest_find_symbols suite passes. Microbenchmarks behind a real call site show the short-field CSV case beats master at every delimiter offset 0..16 while long-haystack throughput is unchanged (1.13 GB/s, identical to master). Closes: ClickHouse#106605 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The trim_numbers performance test query 1 (ltrim(toString(number))) regressed ~12-14% on ARM/aarch64 after the NEON rewrite of find_first_not_symbols (PR ClickHouse#104228). Decimal strings have no leading whitespace, so default trimLeft is a no-op that still calls the per-row symbol scan and copies the whole column into a fresh ColumnString. Detect this case cheaply: default left-trim removes only leading spaces, so a row is unchanged iff its first byte is not a space. When no row would change, return the input column directly and skip the per-row scan and the full copy. This sidesteps the regressed code path entirely for the common case and avoids materializing an identical column. Other paths (trimRight, trimBoth, custom trim characters, FixedString) are unchanged. The mixed-column case (some row starts with a space) falls back to the normal path, so leading spaces are still removed. Closes: ClickHouse#106601 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The trim_numbers performance test query 1 (ltrim(toString(number))) regressed ~12-14% on ARM/aarch64 after the NEON rewrite of find_first_not_symbols (PR ClickHouse#104228). Decimal strings have no leading whitespace, so default trimLeft still paid the per-row SIMD scan setup for a no-op. Restructure FunctionTrim into three separate non-template paths (vectorSpace, vectorCustom, vectorFixed) instead of the template-with-if-constexpr machinery plus executeImpl. The default (space) path now advances begin/end with a scalar while loop comparing single bytes, which beats the SIMD scan for short strings and exits after one comparison when there is no leading space. Output capacity is reserved once and the size is set after the loop, removing the per-row resize. trim_left/trim_right are cheap runtime ifs in the loop body. Custom trim character sets keep O(1) table lookups; FixedString is handled by a single non-template function. Results are unchanged for all inputs (verified against the regex baseline and under ASan/UBSan). Closes: ClickHouse#106601 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>





On AArch64 the SIMD search loops in
base/base/find_symbols.hpreviously fell through to the scalar byte-by-byte fallback (only__SSE2__and__SSE4_2__were handled). This patch adds NEON code paths to all four loops (find_first_symbols_sse2andfind_last_symbols_sse2, both compile-time-template and runtime-needle variants), so functions built on top —find_first_symbols,find_first_not_symbols,find_last_symbols_or_null,find_last_not_symbols_or_null,splitInto, etc. — get vectorised on ARM.The NEON helpers (
neon_is_in) mirror the SSE2 ones (mm_is_in): each needle is broadcast withvdupq_n_u8, compared withvceqq_u8, and OR-folded withvorrq_u8. There is no direct equivalent ofpmovmskbon AArch64, so the 16-byte all-0/all-1 result is compressed with the standardvshrn_n_u16(..., 4)trick into a 64-bit value where each input byte occupies a 4-bit nibble; the first/last matching byte is then recovered as__builtin_ctzll(mask) >> 2/__builtin_clzll(mask) >> 2. Negative search inverts the equality vector withvmvnq_u8before compression. The same trick is already used incontrib/simdjson,contrib/vectorscan,contrib/SimSIMD, andcontrib/StringZilla.__SSE4_2__is x86-only, so on AArch64 the dispatch falls through to the_sse2-named variants for any number of symbols 1–16, and the_mm_cmpestr*path is not duplicated.Verified with a standalone harness that mirrors
gtest_find_symbols(including theNullCharactercase) and runs 5000 random fuzz iterations comparing against a scalar reference, plus long-haystack and edge-position cases for both compile-time and runtime needles.Microbenchmark (function-level, 64MB buffer, lowercase ASCII)
find_first_symbols<'\t'>find_first_symbols<4 chars>find_first_symbols<8 chars>find_first_symbols(runtime, 4 chars)find_first_symbols(runtime, 8 chars)Real-query performance — ClickBench
hitsshard (1M rows of real web data: URL, Title, SearchPhrase)5 iterations, average wall time, AWS Graviton (
aarch64). Both binaries built from the same26.5.1.1master source; the only difference is whetherfind_symbols.hhas the NEON path.TSVcount()extractURLParameter(URL, 'q')splitByChar('/', URL)TSVoutput (write 1M rows)CSVcount()JSONEachRowoutput (write 1M rows)JSONEachRowcount()Real-query performance — synthetic 10M-row workloads (1-1.5GB on disk)
Same setup; chosen to exercise specific call sites with controlled density.
TSVcount()TSVsum(numeric_col)LineAsStringcount()splitByChar(',', s)over 10M strspath(url)over 10M URLsqueryString(url)over 10M URLsextractURLParameter(url, 'q')JSONEachRow(sparse strings)CSVcount()(synthetic, dense delims)JSONEachRowcount()(synthetic, dense)Note on the dense-
JSONEachRowregressionOn the synthetic dataset above the average distance between matches is ~6-8 bytes (every quoted field ends inside a single 16-byte vector chunk). Sampling profiler localises the regression to
DB::JSONUtils::skipRowForJSONEachRowImpl<'{','}'>(1387 → 2344 samples on a single query): when match cadence is shorter than the vector width, the SIMD body re-scans more bytes than scalar would. This is an inherent vectorisation trade-off and is equally present in the existing SSE2 path on x86. On the realhitsdataset (longer string fields) the JSON regression shrinks to ~6% — and the wins on TSV / URL functions /splitByChar/LineAsStringare 1.3-2.6x.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Vectorise
find_first_symbols,find_first_not_symbols,find_last_symbols_or_null,find_last_not_symbols_or_null, andsplitIntoon AArch64 using NEON. Previously these helpers had a SIMD path only on x86 (SSE2 / SSE4.2) and fell through to a scalar loop on ARM. Speeds up TSV parsing by ~2x, URL functions andsplitByCharby ~1.3x on the ClickBenchhitsdataset; very dense JSON parsing (sub-16-byte field cadence) regresses slightly in line with the existing SSE2 trade-off.Documentation entry for user-facing changes
Version info
26.5.1.660