find_symbols.h: implement AArch64/NEON SIMD path by alexey-milovidov · Pull Request #104228 · ClickHouse/ClickHouse · GitHub
Skip to content

find_symbols.h: implement AArch64/NEON SIMD path#104228

Merged
alexey-milovidov merged 19 commits into
masterfrom
find-symbols-neon
May 15, 2026
Merged

find_symbols.h: implement AArch64/NEON SIMD path#104228
alexey-milovidov merged 19 commits into
masterfrom
find-symbols-neon

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 6, 2026

Copy link
Copy Markdown
Member

On AArch64 the SIMD search loops in base/base/find_symbols.h previously fell through to the scalar byte-by-byte fallback (only __SSE2__ and __SSE4_2__ were handled). This patch adds NEON code paths to all four loops (find_first_symbols_sse2 and find_last_symbols_sse2, both compile-time-template and runtime-needle variants), so functions built on top — find_first_symbols, find_first_not_symbols, find_last_symbols_or_null, find_last_not_symbols_or_null, splitInto, etc. — get vectorised on ARM.

The NEON helpers (neon_is_in) mirror the SSE2 ones (mm_is_in): each needle is broadcast with vdupq_n_u8, compared with vceqq_u8, and OR-folded with vorrq_u8. There is no direct equivalent of pmovmskb on AArch64, so the 16-byte all-0/all-1 result is compressed with the standard vshrn_n_u16(..., 4) trick into a 64-bit value where each input byte occupies a 4-bit nibble; the first/last matching byte is then recovered as __builtin_ctzll(mask) >> 2 / __builtin_clzll(mask) >> 2. Negative search inverts the equality vector with vmvnq_u8 before compression. The same trick is already used in contrib/simdjson, contrib/vectorscan, contrib/SimSIMD, and contrib/StringZilla.

__SSE4_2__ is x86-only, so on AArch64 the dispatch falls through to the _sse2-named variants for any number of symbols 1–16, and the _mm_cmpestr* path is not duplicated.

Verified with a standalone harness that mirrors gtest_find_symbols (including the NullCharacter case) and runs 5000 random fuzz iterations comparing against a scalar reference, plus long-haystack and edge-position cases for both compile-time and runtime needles.

Microbenchmark (function-level, 64MB buffer, lowercase ASCII)

Search Scalar (master) NEON (this PR) Speedup
find_first_symbols<'\t'> 2.78 GB/s 26.19 GB/s 9.4x
find_first_symbols<4 chars> 1.45 GB/s 15.09 GB/s 10.4x
find_first_symbols<8 chars> 1.28 GB/s 8.98 GB/s 7.0x
find_first_symbols(runtime, 4 chars) 1.07 GB/s 13.56 GB/s 12.7x
find_first_symbols(runtime, 8 chars) 0.62 GB/s 7.93 GB/s 12.8x

Real-query performance — ClickBench hits shard (1M rows of real web data: URL, Title, SearchPhrase)

5 iterations, average wall time, AWS Graviton (aarch64). Both binaries built from the same 26.5.1.1 master source; the only difference is whether find_symbols.h has the NEON path.

Workload Scalar (s) NEON (s) Speedup
TSV count() 0.288 0.140 2.06x
extractURLParameter(URL, 'q') 0.238 0.182 1.31x
splitByChar('/', URL) 0.242 0.190 1.27x
TSV output (write 1M rows) 0.280 0.252 1.11x
CSV count() 0.196 0.190 1.03x
JSONEachRow output (write 1M rows) 0.386 0.386 1.00x
JSONEachRow count() 0.310 0.330 0.94x

Real-query performance — synthetic 10M-row workloads (1-1.5GB on disk)

Same setup; chosen to exercise specific call sites with controlled density.

Workload Scalar (s) NEON (s) Speedup
TSV count() 1.15 0.44 2.61x
TSV sum(numeric_col) 1.56 0.85 1.84x
LineAsString count() 0.59 0.29 2.03x
splitByChar(',', s) over 10M strs 1.30 0.70 1.86x
path(url) over 10M URLs 0.88 0.48 1.84x
queryString(url) over 10M URLs 0.89 0.48 1.85x
extractURLParameter(url, 'q') 0.88 0.48 1.85x
JSONEachRow (sparse strings) 0.32 0.29 1.14x
CSV count() (synthetic, dense delims) 0.83 0.93 0.88x
JSONEachRow count() (synthetic, dense) 1.54 2.48 0.62x

Note on the dense-JSONEachRow regression

On the synthetic dataset above the average distance between matches is ~6-8 bytes (every quoted field ends inside a single 16-byte vector chunk). Sampling profiler localises the regression to DB::JSONUtils::skipRowForJSONEachRowImpl<'{','}'> (1387 → 2344 samples on a single query): when match cadence is shorter than the vector width, the SIMD body re-scans more bytes than scalar would. This is an inherent vectorisation trade-off and is equally present in the existing SSE2 path on x86. On the real hits dataset (longer string fields) the JSON regression shrinks to ~6% — and the wins on TSV / URL functions / splitByChar / LineAsString are 1.3-2.6x.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Vectorise find_first_symbols, find_first_not_symbols, find_last_symbols_or_null, find_last_not_symbols_or_null, and splitInto on AArch64 using NEON. Previously these helpers had a SIMD path only on x86 (SSE2 / SSE4.2) and fell through to a scalar loop on ARM. Speeds up TSV parsing by ~2x, URL functions and splitByChar by ~1.3x on the ClickBench hits dataset; very dense JSON parsing (sub-16-byte field cadence) regresses slightly in line with the existing SSE2 trade-off.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.660

Implements NEON code paths in the SIMD search loops in `find_symbols.h`
(used by `find_first_symbols`, `find_first_not_symbols`, `find_last_symbols_or_null`,
`find_last_not_symbols_or_null`, `splitInto`, etc.). On AArch64 the dispatch
previously fell through to a scalar byte-by-byte loop.

The NEON helpers (`neon_is_in`) mirror the SSE2 ones (`mm_is_in`): each search
symbol is broadcast with `vdupq_n_u8` and OR-folded with `vorrq_u8` over per-byte
`vceqq_u8` results. There is no direct equivalent of `pmovmskb` on AArch64, so
the resulting 16-byte all-0/all-1 vector is compressed with the standard
`vshrn_n_u16(..., 4)` trick into a 64-bit value where each input byte occupies
a 4-bit nibble. The first / last matching position is then recovered as
`__builtin_ctzll(mask) >> 2` / `__builtin_clzll(mask) >> 2`. Negative search
inverts the equality vector with `vmvnq_u8` before compression.

Tested with a standalone harness (mirrors `gtest_find_symbols` plus 5000 random
fuzz iterations against a scalar reference, plus long-haystack and null-byte
cases). On a 64MB buffer the NEON path is ~9-13x faster than the scalar
fallback (e.g. single character: 2.78 -> 26.2 GB/s; 8 runtime symbols:
0.62 -> 7.93 GB/s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [06afc7d]

Summary:


AI Review

Summary

This PR adds an AArch64/NEON vectorized path for find_symbols helpers, includes follow-up fixes for runtime-needle correctness/performance edge cases, and adds focused regression coverage for empty runtime needles, embedded \0, and long-haystack reverse runtime paths. After reviewing the current diff and all prior discussion threads/comments, I did not find unresolved correctness, safety, or evidence gaps in the latest PR head.

Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 6, 2026
@Algunenano Algunenano self-assigned this May 6, 2026
@alexey-milovidov alexey-milovidov marked this pull request as ready for review May 6, 2026 14:48

@Algunenano Algunenano left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We could We overload maybe_negate to handle uint8x16_t and keep the same code structure but that's minor.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, fix Logical error: 'phi_squared <= max_phi_squared + 1e-6' (STID: 2508-2e40) and send a separate PR with the fix.

@groeneai

groeneai commented May 6, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov — STID 2508-2e40 (phi_squared <= max_phi_squared + 1e-6) is the chronic CrossTab/CramersV state-corruption family already tracked under issue #104183.

We posted a comprehensive analysis there earlier today (2026-05-06): #104183 (comment)

TL;DR:

Opening a separate PR purely for the phi_squared manifestation would splinter what is one underlying bug. Will file the fix under #104183 once the offending combinator is pinpointed. Happy to upstream debug-build invariants in CrossTabCountsState::serialize/merge as a stand-alone diagnostic PR if useful in the meantime.

@groeneai (worker session cron:clickhouse-ci-task-worker:20260506-184500)

alexey-milovidov and others added 2 commits May 6, 2026 21:17
Refactors the NEON paths in find_symbols.h to use a uint8x16_t overload of
maybe_negate, mirroring the structure of the SSE2 paths. No behavior change
(the previous explicit `if constexpr (!positive) eq = vmvnq_u8(eq)` is the
same operation).

#104228 (review)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k overhead

The previous NEON `find_first_symbols_sse2` runtime-needle path padded unused
needle slots in `neon_is_in_prepare` with `vdupq_n_u8(0)` and unconditionally
iterated all 16 needles in `neon_is_in_execute`. Two issues followed:

1. Correctness: a haystack containing `\0` would falsely match in the SIMD
   body whenever fewer than 16 needles were given, because the padding zeros
   compared equal to NUL bytes. The bug only manifests for haystacks at least
   16 bytes long, so the existing `FindNotSymbols.NullCharacter` test (9-byte
   haystack) did not catch it.
2. Performance: the prepare phase ran 16 `vdupq_n_u8` broadcasts even when
   the haystack was shorter than 16 bytes (so the SIMD body never executed),
   and the execute phase did 16 `vceqq_u8` + 16 `vorrq_u8` per block even
   for tiny needle sets. This caused regressions on `trim_numbers` with
   short decimal strings against a 6-character needle, surfaced by perf CI
   on PR #104228.

The fix:
- `neon_is_in_prepare` only fills `num_chars` slots.
- `neon_is_in_execute` takes `num_chars` and only iterates over real needles.
- The prepare call is moved inside an `if (pos + 15 < end)` guard so it is
  skipped entirely for haystacks too short for the SIMD body.

The extended `FindNotSymbols.NullCharacter` test exercises the SIMD body
with a NUL byte at offset 16 to lock in the correctness fix.
Comment thread base/base/find_symbols.h
alexey-milovidov and others added 3 commits May 7, 2026 18:58
For the runtime-needle variants of `find_first_symbols_sse2` and
`find_last_symbols_sse2`, wrap the SSE2 main loop (and the
`mm_is_in_prepare` call before it) in a guard so we don't pay the
needle-preparation cost when the haystack is shorter than 16 bytes
and the SIMD body would not execute.

Mirrors the same guard already added to the AArch64/NEON path in this
PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread base/base/find_symbols.h Outdated
Before this fix, the AArch64 NEON path was hot for sparse matches (URL
parsers, etc.) but slowed down dense-match callers (CSV/TSV/JSON format
readers, `trim` with custom characters): for each call the SIMD body did
a full vector load + 3-4 `vceqq` + `vorrq` + `shrn` + `ctz`, which costs
more than the handful of byte compares scalar would do when the match is
within the first few bytes. CI perf comparison on PR #104228 surfaced
this as +27% to +74% regressions on `count_from_formats`,
`tsv_csv_nullable_parsing`, `formats_columns_sampling`,
`json_input_format_part_fields`, and `trim_numbers` on
`Performance Comparison (arm_release, master_head, 5/6)`.

The fix is a short scalar pre-check (8 bytes) added before the SIMD
body in all four AArch64 paths (`find_first_symbols_sse2` and
`find_last_symbols_sse2`, both compile-time-template and runtime-needle
variants). Matches that fall in those 8 bytes return without ever
entering the SIMD loop, and the SIMD body still dominates for sparse
matches further into the haystack.

Microbenchmarks (64MB buffer, 10 iterations, AArch64 c8g):

| Pattern | Scalar | NEON (before fix) | NEON (with fix) |
|---|---|---|---|
| CSV-like dense (1-7 byte gaps, 3 needles) | 1350 MB/s | 396 MB/s | 2130 MB/s |
| TSV-like dense (1-7 byte gaps, 2 needles) | 1430 MB/s | 568 MB/s | 2660 MB/s |
| URL-like sparse (~50 byte gaps, 2 needles) | 5950 MB/s | 7470 MB/s | 7000 MB/s |

The 5000-iteration random fuzz harness from the PR description plus
the gtest cases (including `FindNotSymbols.NullCharacter` against the
17-byte haystack that exercises the SIMD body) all pass.

Perf failure on #104228
alexey-milovidov and others added 2 commits May 11, 2026 09:20
`trim_numbers` got 10-22% slower on AArch64 in the CI perf comparison for
`PR=104228` because clang inlined the new NEON body into the per-row callers,
adding extra branches in the short-haystack fast path and (for the runtime
needle variant) reserving a 320-byte stack frame for `std::array<uint8x16_t,
16> needles` even when the SIMD body was skipped.

Move the NEON work into `[[gnu::noinline]]` `find_first_symbols_neon_long`
/ `find_last_symbols_neon_long` (compile-time needles) and
`find_first_symbols_neon_long_rt` / `find_last_symbols_neon_long_rt`
(runtime needles), then tail-call them from the inline dispatcher only when
`end - begin >= 16`. The inline body collapses back to the original scalar
loop the compiler can auto-vectorise/unroll, and the runtime-needle path
no longer reserves stack space for `std::array<uint8x16_t, 16>` per call.

Also drop `neon_is_in_prepare` / `neon_is_in_execute` in favour of the
already-existing `neon_is_in(bytes, symbols, num_chars)` overload: on
AArch64 with clang the by-value `std::array` return forced the spilling.

Local measurements on the same AArch64 instance, `numbers(10000000)` for
each `trim_numbers` query (`master` is the previously installed
ClickHouse binary, `PR-orig` is `b045a07c396` of this branch, `V7` is
this commit):

| query                                      | master | PR-orig | V7   |
| ------------------------------------------ | ------ | ------- | ---- |
| `trim(toString(number))`                   |  103ms |   110ms | 104ms |
| `ltrim(toString(number))`                  |   83ms |   101ms |  98ms |
| `rtrim(toString(number))`                  |   86ms |   104ms |  99ms |
| `trim(LEADING '012345' FROM ...)`          |  139ms |   178ms | 144ms |
| `trim(TRAILING '012345' FROM ...)`         |  168ms |   188ms | 174ms |
| `trim(BOTH '012345' FROM ...)`             |  246ms |   275ms | 229ms |

CI report: #104228

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread base/base/find_symbols.h
alexey-milovidov and others added 2 commits May 12, 2026 09:39
Adds focused regression tests for `find_last_symbols_or_null(string_view,
SearchSymbols)` and `find_last_not_symbols_or_null(string_view,
SearchSymbols)` on haystacks long enough to enter the AArch64 NEON
reverse-search body (`end - begin >= 16`).

Covers the reverse index mapping (`__builtin_clzll(mask) >> 2`) for the
runtime-needle path:
  - 32-byte haystack: matches in the most-recent SIMD chunk, in the
    earlier chunk only, at the chunk boundary, multiple matches, and
    a no-match case.
  - 17-byte haystack: hits the 1-byte scalar tail of the reverse helper,
    the first byte of the haystack, and an interior SIMD-body match.
  - `\0` byte in the haystack: positive find_last must locate it when
    `\0` is in the needle, and must not falsely match it otherwise.
  - Negative variant: prologue boundary, SIMD-body interior match,
    chunk-boundary case, all-needle (must return nullptr), and `\0` as
    a non-needle byte that must be found.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`find_first_symbols_sse2` and `find_last_symbols_sse2` (runtime variants)
call `mm_is_in_prepare`, which value-initialises a `std::array<__m128i, 16>`
and only fills the first `num_chars` slots. The remaining slots stay at
`_mm_setzero_si128()`, i.e. as if they contained the byte `\0`. The
subsequent `mm_is_in_execute` ORs all 16 comparisons, so any `\0` byte
in the haystack spuriously matches via an unused slot even when `\0` is
not in the needle. The bug is observable for `num_chars < 5`; for larger
needles the dispatcher routes to the SSE 4.2 path instead.

Pad the unused slots with a repeat of `result[0]` so the comparison only
ever matches bytes that actually appear in the needle. The setup cost is
amortised across the SIMD loop and the per-chunk hot path is unchanged.

Exposed by the new regression tests added in this PR for
`find_last_symbols_or_null(string_view, SearchSymbols)` and
`find_last_not_symbols_or_null(string_view, SearchSymbols)` against
haystacks containing an embedded `\0`. The corresponding NEON paths are
unaffected because they call `neon_is_in(bytes, symbols, num_chars)`
directly and iterate exactly `num_chars` times.

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104228&sha=726ffb1642471e8ebfe17e1e802ddc1e09490cd2&name_0=PR&name_1=Unit%20tests%20%28asan_ubsan%29

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, investigate the chronic Possible deadlock on shutdown (see gdb.log) failure on Stress test (arm_debug): https://s3.amazonaws.com/clickhouse-test-reports/PRs/104228/726ffb1642471e8ebfe17e1e802ddc1e09490cd2/stress_test_arm_debug/gdb.log and provide a fix in a separate PR. This fails 8-10 times per day on master (CIDB) and is unrelated to this PR (the stuck threads are in absl::Mutex shutdown, no find_symbols frames). If a fix is already in progress, link it here.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov investigation results — gdb.log analysis of PRs/104228/726ffb1.../stress_test_arm_debug/gdb.log.

Root cause: leaked local ThreadPool instances at server shutdown

Decoded the gdb.log (725 threads, 7.2MB). The deadlock is not in absl::Mutex shutdown itself — those frames are a side effect. The real blocker is in GlobalThreadPool::shutdown() which can't join its own std::threads because each is running a leaked local pool's worker() loop.

Main thread (LWP 499292)

#0  __futex_abstimed_wait_common64
#3  __pthread_clockjoin_ex                                              ← joining a worker thread
#5  std::thread::join (this=0xff337a7c8ae8)
#6  ThreadPoolImpl<std::thread>::ThreadFromThreadPool::join
#7  ThreadPoolImpl<std::thread>::finalize (this=0xff355d30bc00)        ← GlobalThreadPool::finalize
#8  Server::main::$_15::operator()                       Server.cpp:1482
#9  BasicScopeGuard::invoke
#11 Server::main                                          Server.cpp:3296

Server.cpp:1482 is the last scope guard:

SCOPE_EXIT_SAFE({
    LOG_INFO(log, "Waiting for background threads");
    GlobalThreadPool::instance().shutdown();      // <-- stuck here
    LOG_INFO(log, "Background threads finished in {} ms", watch.elapsedMilliseconds());
});

The thread it's waiting on (LWP 500233 "ThreadPool")

#0  __futex_abstimed_wait_common64
#5  std::condition_variable::wait
#7  std::condition_variable::wait<...lambda#2>(unique_lock, pred)
#8  ThreadPoolImpl<ThreadFromGlobalPoolImpl<false,true>>::ThreadFromThreadPool::worker
                                                          ThreadPool.cpp:736   ← inner pool's wait loop
#15 ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker
                                                          ThreadPool.cpp:809   ← outer std::thread running the job

The std::thread (frame #15, owned by GlobalThreadPool) is running a job which is "be the inner pool's worker forever" (frame #8 at line 736). The inner pool's parent_pool.new_job_or_shutdown.wait(...) only exits when parent_pool.shutdown is set. GlobalThreadPool::shutdown() sets its OWN shutdown, not the inner pool's — those are independent.

Scale of the leak (this run)

56 worker threads stuck at ThreadPool.cpp:736 across 7 distinct leaked pools:

Mutex addr Stuck workers
0xff34114a6b00 16
0xff34114a6800 16
0xff34114a6c80 8
0xff34114a6980 8
0xff355d312e00 3 ← main thread is waiting on a worker of this pool
0xff355d312c80 3
0xff355d312f80 2

All threads are named "ThreadPool" (the default — setThreadName runs inside the worker loop AFTER receiving a job, but these never received one). So the leaked pools can't be identified by thread name alone.

Why this matches the chronic 8-10/day rate

Local ThreadPool (= ThreadPoolImpl<ThreadFromGlobalPool>) instances must be destroyed (which calls finalize() → sets shutdown → notifies) before GlobalThreadPool::shutdown() runs. Some subsystem(s) hold a reference to a local ThreadPool past the scope guard at Server.cpp:1482. Likely candidates: long-lived singletons, registries, or thread-local resources that aren't drained in Server::main cleanup. Different stress runs hit different leakers, which is consistent with this being an umbrella affecting multiple subsystems (8-10/day spread across several specific patterns).

The absl::Mutex frames you saw

gRPC has its own WorkStealingThreadPool. 17 threads in this gdb.log are stuck in gRPC's WorkSignal::WaitWithTimeout (which uses absl::Mutex). They're not the blocker — they're just idle workers of gRPC's pool that haven't been told to exit because gRPC's EventEngine global is presumably one of the things that doesn't clean up before the Server::main scope guard. They're a symptom of the same broader pattern (subsystem lifetimes not properly ordered against GlobalThreadPool shutdown), not an independent deadlock.

Direction for fix

Two angles, complementary:

  1. Targeted (per-subsystem): Identify the leaked local pools. The most practical way is to add a LOG_INFO listing name, num_threads, scheduled_jobs for every ThreadPoolImpl still alive immediately before the GlobalThreadPool::shutdown() call (or in GlobalThreadPool::finalize itself, before notify_all). When chronic hits land in CI, the log will name the leaker(s). Possibly enable behind a debug flag to avoid prod overhead. Then fix the lifetime in each named owner.

  2. Structural: A global registry of live ThreadPoolImpl<ThreadFromGlobalPool> instances; GlobalThreadPool::shutdown() walks the registry, sets shutdown=true and notifies each registered pool's condvar before joining its own std::threads. This breaks the dependency on subsystem destruction order. Slight added cost per pool construction/destruction (register/unregister + lock); modest. Downside: papers over real lifetime bugs.

Recommendation: do (1) first to identify the offenders, then (2) as a backstop if the list of leakers is long-tailed.

Not pushing a fix yet

Per our policy, no fix PR without local reproduction first. Will:

  • Build a debug binary with the diagnostic LOG_INFO patch from (1) and run stress locally to confirm at least one specific leaker
  • Open a fix PR for whichever subsystem(s) show up
  • If the list is long, also open (2) as a backstop

Tracking under task 2026-04-01-ci-hung-check-deadlock-chronic-trunk-is. CIDB rate (14d): 8-10/day on master with a 78-hit spike on May 11 (concentrated in 2 PRs, possibly a single bad run). Will keep this thread updated.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, investigate the failure: 01410_nullable_key_and_index on Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel) and provide a fix in a separate PR. This is unrelated to this PR (the diff is row ordering of a Nullable key under ParallelReplicas, no find_symbols frames). It is flaky on master (CIDB shows 4 hits in the last 30 days across 4 different PRs). If a fix is already in progress, link it here.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, investigate the CH Inc sync failure: the sync PR ClickHouse/clickhouse-private#57269 fails on test_statistics_cache/test_smt.py::test_statistics_cache_smt (Integration tests (amd_asan_ubsan, db disk, old analyzer, 5/6)). This is unrelated to this PR — the diff is entirely under base/base/find_symbols.h (AArch64/NEON path + SSE2 zero-pad fix) and has no contact with the SharedMergeTree statistics cache. Please look into the chronic failure and provide a fix in a separate PR. If a fix is already in progress, link it here.

@groeneai groeneai mentioned this pull request May 13, 2026
Comment thread base/base/find_symbols.h
alexey-milovidov and others added 4 commits May 14, 2026 06:14
Address review feedback on PR #104228: `mm_is_in_prepare` documents the
assumption that `num_chars >= 1`, but `SearchSymbols` still allows an
empty needle. With `num_chars == 0` on the `__SSE2__` runtime path, the
prepared vector remains effectively zero-filled, so `mm_is_in_execute`
would match embedded `\0` bytes in SIMD chunks.

Add an explicit `symbols.str.empty()` fast path to both
`find_first_symbols_dispatch` and `find_last_symbols_dispatch` (the
runtime-needle entry points) so the SIMD body is never invoked with an
empty needle. The fast path matches the scalar semantics: for positive
search nothing is found (returns end/nullptr); for the negative variant
the first (or last, in the find-last case) byte qualifies.

Cover the case with a new `FindSymbols.EmptyRunTimeNeedle` gtest that
exercises short, empty, and long (>= 16 bytes, with embedded `\0`)
haystacks across all six runtime-needle entry points.
Fixes the build failure on `Build (amd_debug)`, `Build (amd_asan_ubsan)`,
`Build (amd_tsan)`, `Build (amd_msan)`, and `Build (llvm_coverage_build)`
introduced by 65e6fb5:

  error: default initialization of an object of const type
  'const SearchSymbols' without a user-provided default constructor

`SearchSymbols` carries a `__m128i simd_vector` member under `__SSE4_2__`
and an explicitly-defaulted `SearchSymbols() = default;` constructor, so
default-initializing a `const` instance leaves `simd_vector` uninitialized
and the compiler rejects the form. Switch the test to value-initialization
(`SearchSymbols empty{}`), which zero-initializes the trivial members and
is well-formed for `const`.

The runtime semantics are unchanged: the new `symbols.str.empty()` fast
path in `find_first_symbols_dispatch` / `find_last_symbols_dispatch`
already bypasses the SIMD body, so the value of `simd_vector` is never
observed for the empty-needle case the test exercises.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104228/65e6fb580a6f05bb6b3822bb8a0902e6f3cccd91/build_amd_debug/build_clickhouse/build_clickhouse.log

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 92.00% 92.00% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 100.00% (109/109) · Uncovered code

Full report · Diff report

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Improvements exactly where we expected:

Screenshot_20260515_014550 Screenshot_20260515_014605 Screenshot_20260515_014625 Screenshot_20260515_014644

@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 14, 2026
Merged via the queue into master with commit c1ccce2 May 15, 2026
167 checks passed
@alexey-milovidov alexey-milovidov deleted the find-symbols-neon branch May 15, 2026 00:08
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 15, 2026
@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov, investigation of the test_statistics_cache_smt failure on CH Inc sync PR clickhouse-private#57269.

What we can see on the public side

test_statistics_cache (the public MergeTree sibling — tests/integration/test_statistics_cache/test.py) has 0 master failures in the last 60 days. CIDB query:

SELECT countIf(pull_request_number=0) AS master, countIf(pull_request_number>0) AS prs
FROM default.checks
WHERE test_name LIKE '%test_statistics_cache%'
  AND test_status IN ('FAIL','ERROR')
  AND check_start_time > now() - INTERVAL 60 DAY
-- master=0, prs=1048, distinct_prs=15

All 1048 PR failures are concentrated on 2 PRs, both unmerged:

PR Author Sub-tests failing Last seen
#99736 Optimize AND comparison chains @wudidapaopao 6 / 13 (load, session_toggle, interval_zero, alter_interval, countmin_lc, auto_types) 2026-05-09
#100185 Add assertions in IColumn::mutate to verify deep unique ownership of sub-columns @alexey-milovidov 13 / 13 (all) 2026-05-06

Notably your own PR #100185 makes every test_statistics_cache sub-test fail — the assertions you added fire deep inside the cache's statistics-loading path. PR #100185 touches ColumnReplicated.{cpp,h} among many other column types, which is the column used by SMT replication. Likely the same latent issue your assertions are designed to surface (column shared-ownership violation during IColumn::mutate) also exists in code reachable from the SMT statistics-load path.

What we can't see

The failing report URL is https://d1k2gkhrlfqv31.cloudfront.net/clickhouse-test-reports-private/... which returns HTTP 401 (Basic auth required) — we can't read the private CI artifact. The test file test_statistics_cache/test_smt.py is also in the private repo only (public has just test.py).

The shared code path

The LoadedStatisticsMicroseconds ProfileEvent that drives _assert_load lives in MergeTreeData::getConditionSelectivityEstimatorByPredicate (src/Storages/MergeTree/MergeTreeData.cpp:864). MergeTreeData is the public base for both StorageMergeTree and StorageSharedMergeTree, so the cache (cached_estimator, stats_mutex, refreshStatistics) is shared infrastructure. SMT-only divergence will be in:

  • IMergeTreeDataPart::loadStatistics() overrides on the SMT-side DataPartShared/equivalent
  • the part-discovery code that feeds getDataPartsVectorForInternalUsage under SMT

Recent public master commits touching this path that landed after the last green state and could be the sync regressor:

  • 78d10eb43cd statistics retry after failure (@zoomxi, 2026-05-08) — restructures IMergeTreeDataPart::getEstimates to build a local new_estimates and only move on success; adds merge_tree_load_statistics_throw fail point. If the SMT override of loadStatistics throws once and the new retry contract isn't honored, the assertion _loaded_stats_us > 0 will silently fail.
  • 801397d3c11 / 65ad40dfe08 / e9f27104b27null_count part-pruning + auto-statistics randomization (@zoomxi, PR Add NullCount statistics support for part pruning #104214). Changes how parts are pruned before statistics loading.

What we need from you

Without the private CI artifact we're guessing. To root-cause this we need any of:

  1. The exact failing assertion line + stack trace from test_statistics_cache_smt (paste it in the comment).
  2. The LoadedStatisticsMicroseconds value the failing query reports (0? or non-zero but cached unexpectedly?).
  3. Confirmation of whether the sync PR includes commit 78d10eb43cd and/or 801397d3c11.
  4. Read access to the private report bucket, or a re-publish of the failing test output as a public gist.

If (1) shows the failure is on the same _assert_load line and _loaded_stats_us == 0, the most likely public-side suspect is the new retry contract in 78d10eb43cd not propagating through the SMT-side part override. We'd then fix it on master and request a re-sync.

groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 22, 2026
PR ClickHouse#104228 added a NEON SIMD path to find_symbols.h. On AArch64 the inline
dispatcher tail-calls an out-of-line ([[gnu::noinline]]) helper whenever
end - begin >= 16; that helper ran an 8-byte scalar prefix and then the NEON
loop.

That guard is the wrong signal for format readers (CSV/TSV/JSON). They pass a
haystack that ends at the far read-buffer end, but whose delimiter/quote is
within the first few bytes of the current field. end - begin is large, so
every short field pays a branch to out-of-line code where master used a tiny
inline scalar loop. This is the regression reported in ClickHouse#106605 for the
tsv_csv_nullable_parsing query 1 perf test on ARM (CSV Nullable(String),
1M short rows): about +18..30% on Neoverse-V2.

Move the 8-byte scalar prefix out of the out-of-line helper and into the
inline dispatcher, then tail-call the helper (now the NEON loop only) at
begin + 8 (forward) / end - 8 (backward). Short fields whose delimiter is in
the first 8 bytes now resolve fully inline with no SIMD call and no repeated
work, while sparse long-haystack scans reach the NEON loop unchanged. The
[[gnu::noinline]] attribute now guards only the SIMD loop, which is its
intended purpose: keeping the vector loop out of short-string callers such as
trim. Applied to all four AArch64 dispatch sites (find_first / find_last,
compile-time and runtime needles). x86 paths are unchanged.

Validated by cross-compiling for aarch64 (clang-21) and running the NEON code
under qemu-aarch64: 2.4M randomized/exhaustive correctness checks against a
scalar reference with 0 mismatches, and the existing gtest_find_symbols suite
passes. Microbenchmarks behind a real call site show the short-field CSV case
beats master at every delimiter offset 0..16 while long-haystack throughput is
unchanged (1.13 GB/s, identical to master).

Closes: ClickHouse#106605

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 22, 2026
The trim_numbers performance test query 1 (ltrim(toString(number))) regressed
~12-14% on ARM/aarch64 after the NEON rewrite of find_first_not_symbols
(PR ClickHouse#104228). Decimal strings have no leading whitespace, so default trimLeft
is a no-op that still calls the per-row symbol scan and copies the whole column
into a fresh ColumnString.

Detect this case cheaply: default left-trim removes only leading spaces, so a
row is unchanged iff its first byte is not a space. When no row would change,
return the input column directly and skip the per-row scan and the full copy.
This sidesteps the regressed code path entirely for the common case and avoids
materializing an identical column.

Other paths (trimRight, trimBoth, custom trim characters, FixedString) are
unchanged. The mixed-column case (some row starts with a space) falls back to
the normal path, so leading spaces are still removed.

Closes: ClickHouse#106601

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 24, 2026
The trim_numbers performance test query 1 (ltrim(toString(number)))
regressed ~12-14% on ARM/aarch64 after the NEON rewrite of
find_first_not_symbols (PR ClickHouse#104228). Decimal strings have no leading
whitespace, so default trimLeft still paid the per-row SIMD scan setup
for a no-op.

Restructure FunctionTrim into three separate non-template paths
(vectorSpace, vectorCustom, vectorFixed) instead of the
template-with-if-constexpr machinery plus executeImpl. The default
(space) path now advances begin/end with a scalar while loop comparing
single bytes, which beats the SIMD scan for short strings and exits
after one comparison when there is no leading space. Output capacity is
reserved once and the size is set after the loop, removing the per-row
resize. trim_left/trim_right are cheap runtime ifs in the loop body.

Custom trim character sets keep O(1) table lookups; FixedString is
handled by a single non-template function. Results are unchanged for all
inputs (verified against the regex baseline and under ASan/UBSan).

Closes: ClickHouse#106601

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants