iframe-proxy

alexey-milovidov · 2026-05-06T13:59:30Z

On AArch64 the SIMD search loops in base/base/find_symbols.h previously fell through to the scalar byte-by-byte fallback (only __SSE2__ and __SSE4_2__ were handled). This patch adds NEON code paths to all four loops (find_first_symbols_sse2 and find_last_symbols_sse2, both compile-time-template and runtime-needle variants), so functions built on top — find_first_symbols, find_first_not_symbols, find_last_symbols_or_null, find_last_not_symbols_or_null, splitInto, etc. — get vectorised on ARM.

The NEON helpers (neon_is_in) mirror the SSE2 ones (mm_is_in): each needle is broadcast with vdupq_n_u8, compared with vceqq_u8, and OR-folded with vorrq_u8. There is no direct equivalent of pmovmskb on AArch64, so the 16-byte all-0/all-1 result is compressed with the standard vshrn_n_u16(..., 4) trick into a 64-bit value where each input byte occupies a 4-bit nibble; the first/last matching byte is then recovered as __builtin_ctzll(mask) >> 2 / __builtin_clzll(mask) >> 2. Negative search inverts the equality vector with vmvnq_u8 before compression. The same trick is already used in contrib/simdjson, contrib/vectorscan, contrib/SimSIMD, and contrib/StringZilla.

__SSE4_2__ is x86-only, so on AArch64 the dispatch falls through to the _sse2-named variants for any number of symbols 1–16, and the _mm_cmpestr* path is not duplicated.

Verified with a standalone harness that mirrors gtest_find_symbols (including the NullCharacter case) and runs 5000 random fuzz iterations comparing against a scalar reference, plus long-haystack and edge-position cases for both compile-time and runtime needles.

Microbenchmark (function-level, 64MB buffer, lowercase ASCII)

Real-query performance — ClickBench `hits` shard (1M rows of real web data: URL, Title, SearchPhrase)

5 iterations, average wall time, AWS Graviton (aarch64). Both binaries built from the same 26.5.1.1 master source; the only difference is whether find_symbols.h has the NEON path.

Workload	Scalar (s)	NEON (s)	Speedup
`TSV` `count()`	0.288	0.140	2.06x
`extractURLParameter(URL, 'q')`	0.238	0.182	1.31x
`splitByChar('/', URL)`	0.242	0.190	1.27x
`TSV` output (write 1M rows)	0.280	0.252	1.11x
`CSV` `count()`	0.196	0.190	1.03x
`JSONEachRow` output (write 1M rows)	0.386	0.386	1.00x
`JSONEachRow` `count()`	0.310	0.330	0.94x

Real-query performance — synthetic 10M-row workloads (1-1.5GB on disk)

Same setup; chosen to exercise specific call sites with controlled density.

Workload	Scalar (s)	NEON (s)	Speedup
`TSV` `count()`	1.15	0.44	2.61x
`TSV` `sum(numeric_col)`	1.56	0.85	1.84x
`LineAsString` `count()`	0.59	0.29	2.03x
`splitByChar(',', s)` over 10M strs	1.30	0.70	1.86x
`path(url)` over 10M URLs	0.88	0.48	1.84x
`queryString(url)` over 10M URLs	0.89	0.48	1.85x
`extractURLParameter(url, 'q')`	0.88	0.48	1.85x
`JSONEachRow` (sparse strings)	0.32	0.29	1.14x
`CSV` `count()` (synthetic, dense delims)	0.83	0.93	0.88x
`JSONEachRow` `count()` (synthetic, dense)	1.54	2.48	0.62x

Note on the dense-`JSONEachRow` regression

On the synthetic dataset above the average distance between matches is ~6-8 bytes (every quoted field ends inside a single 16-byte vector chunk). Sampling profiler localises the regression to DB::JSONUtils::skipRowForJSONEachRowImpl<'{','}'> (1387 → 2344 samples on a single query): when match cadence is shorter than the vector width, the SIMD body re-scans more bytes than scalar would. This is an inherent vectorisation trade-off and is equally present in the existing SSE2 path on x86. On the real hits dataset (longer string fields) the JSON regression shrinks to ~6% — and the wins on TSV / URL functions / splitByChar / LineAsString are 1.3-2.6x.

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Vectorise find_first_symbols, find_first_not_symbols, find_last_symbols_or_null, find_last_not_symbols_or_null, and splitInto on AArch64 using NEON. Previously these helpers had a SIMD path only on x86 (SSE2 / SSE4.2) and fell through to a scalar loop on ARM. Speeds up TSV parsing by ~2x, URL functions and splitByChar by ~1.3x on the ClickBench hits dataset; very dense JSON parsing (sub-16-byte field cadence) regresses slightly in line with the existing SSE2 trade-off.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Version info

Merged into: 26.5.1.660

Implements NEON code paths in the SIMD search loops in `find_symbols.h` (used by `find_first_symbols`, `find_first_not_symbols`, `find_last_symbols_or_null`, `find_last_not_symbols_or_null`, `splitInto`, etc.). On AArch64 the dispatch previously fell through to a scalar byte-by-byte loop. The NEON helpers (`neon_is_in`) mirror the SSE2 ones (`mm_is_in`): each search symbol is broadcast with `vdupq_n_u8` and OR-folded with `vorrq_u8` over per-byte `vceqq_u8` results. There is no direct equivalent of `pmovmskb` on AArch64, so the resulting 16-byte all-0/all-1 vector is compressed with the standard `vshrn_n_u16(..., 4)` trick into a 64-bit value where each input byte occupies a 4-bit nibble. The first / last matching position is then recovered as `__builtin_ctzll(mask) >> 2` / `__builtin_clzll(mask) >> 2`. Negative search inverts the equality vector with `vmvnq_u8` before compression. Tested with a standalone harness (mirrors `gtest_find_symbols` plus 5000 random fuzz iterations against a scalar reference, plus long-haystack and null-byte cases). On a 64MB buffer the NEON path is ~9-13x faster than the scalar fallback (e.g. single character: 2.78 -> 26.2 GB/s; 8 runtime symbols: 0.62 -> 7.93 GB/s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-06T14:00:10Z

Workflow [PR], commit [06afc7d]

Summary: ✅

AI Review

Summary

This PR adds an AArch64/NEON vectorized path for find_symbols helpers, includes follow-up fixes for runtime-needle correctness/performance edge cases, and adds focused regression coverage for empty runtime needles, embedded \0, and long-haystack reverse runtime paths. After reviewing the current diff and all prior discussion threads/comments, I did not find unresolved correctness, safety, or evidence gaps in the latest PR head.

Final Verdict

Status: ✅ Approve

Algunenano

Looks good. We could We overload maybe_negate to handle uint8x16_t and keep the same code structure but that's minor.

alexey-milovidov · 2026-05-06T18:07:48Z

@groeneai, fix Logical error: 'phi_squared <= max_phi_squared + 1e-6' (STID: 2508-2e40) and send a separate PR with the fix.

groeneai · 2026-05-06T18:49:04Z

@alexey-milovidov — STID 2508-2e40 (phi_squared <= max_phi_squared + 1e-6) is the chronic CrossTab/CramersV state-corruption family already tracked under issue #104183.

We posted a comprehensive analysis there earlier today (2026-05-06): #104183 (comment)

TL;DR:

5 distinct stack signatures observed across 13 days, all dating from after PR Optimize cramersV, cramersVBiasCorrected, theilsU, and contingency when used with window functions #93384 (2026-04-23)
The phi_squared assertion is one of those signatures; sibling signatures hit HashTable::write and HashTable::begin from CrossTabCountsState::serialize and CrossTabCountsState::merge
Root cause hypothesis: a missing getStateVariant / canMergeStateFromDifferentVariant combinator override, or a sizeOfData mismatch between CrossTabAggregateData (GROUP BY state) and CrossTabPhiSquaredWindowData (window state) — leading to placement-new in a wrong-sized buffer and a torn HashMap that later yields a 0xa faulting address or an out-of-range phi_squared
Cannot reproduce locally without the specific concurrent stress conditions; @nihalzp (PR Optimize cramersV, cramersVBiasCorrected, theilsU, and contingency when used with window functions #93384 author) was pinged for a combinator audit

Opening a separate PR purely for the phi_squared manifestation would splinter what is one underlying bug. Will file the fix under #104183 once the offending combinator is pinpointed. Happy to upstream debug-build invariants in CrossTabCountsState::serialize/merge as a stand-alone diagnostic PR if useful in the meantime.

— @groeneai (worker session cron:clickhouse-ci-task-worker:20260506-184500)

Refactors the NEON paths in find_symbols.h to use a uint8x16_t overload of maybe_negate, mirroring the structure of the SSE2 paths. No behavior change (the previous explicit `if constexpr (!positive) eq = vmvnq_u8(eq)` is the same operation). #104228 (review) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…k overhead The previous NEON `find_first_symbols_sse2` runtime-needle path padded unused needle slots in `neon_is_in_prepare` with `vdupq_n_u8(0)` and unconditionally iterated all 16 needles in `neon_is_in_execute`. Two issues followed: 1. Correctness: a haystack containing `\0` would falsely match in the SIMD body whenever fewer than 16 needles were given, because the padding zeros compared equal to NUL bytes. The bug only manifests for haystacks at least 16 bytes long, so the existing `FindNotSymbols.NullCharacter` test (9-byte haystack) did not catch it. 2. Performance: the prepare phase ran 16 `vdupq_n_u8` broadcasts even when the haystack was shorter than 16 bytes (so the SIMD body never executed), and the execute phase did 16 `vceqq_u8` + 16 `vorrq_u8` per block even for tiny needle sets. This caused regressions on `trim_numbers` with short decimal strings against a 6-character needle, surfaced by perf CI on PR #104228. The fix: - `neon_is_in_prepare` only fills `num_chars` slots. - `neon_is_in_execute` takes `num_chars` and only iterates over real needles. - The prepare call is moved inside an `if (pos + 15 < end)` guard so it is skipped entirely for haystacks too short for the SIMD body. The extended `FindNotSymbols.NullCharacter` test exercises the SIMD body with a NUL byte at offset 16 to lock in the correctness fix.

For the runtime-needle variants of `find_first_symbols_sse2` and `find_last_symbols_sse2`, wrap the SSE2 main loop (and the `mm_is_in_prepare` call before it) in a guard so we don't pay the needle-preparation cost when the haystack is shorter than 16 bytes and the SIMD body would not execute. Mirrors the same guard already added to the AArch64/NEON path in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Before this fix, the AArch64 NEON path was hot for sparse matches (URL parsers, etc.) but slowed down dense-match callers (CSV/TSV/JSON format readers, `trim` with custom characters): for each call the SIMD body did a full vector load + 3-4 `vceqq` + `vorrq` + `shrn` + `ctz`, which costs more than the handful of byte compares scalar would do when the match is within the first few bytes. CI perf comparison on PR #104228 surfaced this as +27% to +74% regressions on `count_from_formats`, `tsv_csv_nullable_parsing`, `formats_columns_sampling`, `json_input_format_part_fields`, and `trim_numbers` on `Performance Comparison (arm_release, master_head, 5/6)`. The fix is a short scalar pre-check (8 bytes) added before the SIMD body in all four AArch64 paths (`find_first_symbols_sse2` and `find_last_symbols_sse2`, both compile-time-template and runtime-needle variants). Matches that fall in those 8 bytes return without ever entering the SIMD loop, and the SIMD body still dominates for sparse matches further into the haystack. Microbenchmarks (64MB buffer, 10 iterations, AArch64 c8g): | Pattern | Scalar | NEON (before fix) | NEON (with fix) | |---|---|---|---| | CSV-like dense (1-7 byte gaps, 3 needles) | 1350 MB/s | 396 MB/s | 2130 MB/s | | TSV-like dense (1-7 byte gaps, 2 needles) | 1430 MB/s | 568 MB/s | 2660 MB/s | | URL-like sparse (~50 byte gaps, 2 needles) | 5950 MB/s | 7470 MB/s | 7000 MB/s | The 5000-iteration random fuzz harness from the PR description plus the gtest cases (including `FindNotSymbols.NullCharacter` against the 17-byte haystack that exercises the SIMD body) all pass. Perf failure on #104228

`trim_numbers` got 10-22% slower on AArch64 in the CI perf comparison for `PR=104228` because clang inlined the new NEON body into the per-row callers, adding extra branches in the short-haystack fast path and (for the runtime needle variant) reserving a 320-byte stack frame for `std::array<uint8x16_t, 16> needles` even when the SIMD body was skipped. Move the NEON work into `[[gnu::noinline]]` `find_first_symbols_neon_long` / `find_last_symbols_neon_long` (compile-time needles) and `find_first_symbols_neon_long_rt` / `find_last_symbols_neon_long_rt` (runtime needles), then tail-call them from the inline dispatcher only when `end - begin >= 16`. The inline body collapses back to the original scalar loop the compiler can auto-vectorise/unroll, and the runtime-needle path no longer reserves stack space for `std::array<uint8x16_t, 16>` per call. Also drop `neon_is_in_prepare` / `neon_is_in_execute` in favour of the already-existing `neon_is_in(bytes, symbols, num_chars)` overload: on AArch64 with clang the by-value `std::array` return forced the spilling. Local measurements on the same AArch64 instance, `numbers(10000000)` for each `trim_numbers` query (`master` is the previously installed ClickHouse binary, `PR-orig` is `b045a07c396` of this branch, `V7` is this commit): | query | master | PR-orig | V7 | | ------------------------------------------ | ------ | ------- | ---- | | `trim(toString(number))` | 103ms | 110ms | 104ms | | `ltrim(toString(number))` | 83ms | 101ms | 98ms | | `rtrim(toString(number))` | 86ms | 104ms | 99ms | | `trim(LEADING '012345' FROM ...)` | 139ms | 178ms | 144ms | | `trim(TRAILING '012345' FROM ...)` | 168ms | 188ms | 174ms | | `trim(BOTH '012345' FROM ...)` | 246ms | 275ms | 229ms | CI report: #104228 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds focused regression tests for `find_last_symbols_or_null(string_view, SearchSymbols)` and `find_last_not_symbols_or_null(string_view, SearchSymbols)` on haystacks long enough to enter the AArch64 NEON reverse-search body (`end - begin >= 16`). Covers the reverse index mapping (`__builtin_clzll(mask) >> 2`) for the runtime-needle path: - 32-byte haystack: matches in the most-recent SIMD chunk, in the earlier chunk only, at the chunk boundary, multiple matches, and a no-match case. - 17-byte haystack: hits the 1-byte scalar tail of the reverse helper, the first byte of the haystack, and an interior SIMD-body match. - `\0` byte in the haystack: positive find_last must locate it when `\0` is in the needle, and must not falsely match it otherwise. - Negative variant: prologue boundary, SIMD-body interior match, chunk-boundary case, all-needle (must return nullptr), and `\0` as a non-needle byte that must be found. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`find_first_symbols_sse2` and `find_last_symbols_sse2` (runtime variants) call `mm_is_in_prepare`, which value-initialises a `std::array<__m128i, 16>` and only fills the first `num_chars` slots. The remaining slots stay at `_mm_setzero_si128()`, i.e. as if they contained the byte `\0`. The subsequent `mm_is_in_execute` ORs all 16 comparisons, so any `\0` byte in the haystack spuriously matches via an unused slot even when `\0` is not in the needle. The bug is observable for `num_chars < 5`; for larger needles the dispatcher routes to the SSE 4.2 path instead. Pad the unused slots with a repeat of `result[0]` so the comparison only ever matches bytes that actually appear in the needle. The setup cost is amortised across the SIMD loop and the per-chunk hot path is unchanged. Exposed by the new regression tests added in this PR for `find_last_symbols_or_null(string_view, SearchSymbols)` and `find_last_not_symbols_or_null(string_view, SearchSymbols)` against haystacks containing an embedded `\0`. The corresponding NEON paths are unaffected because they call `neon_is_in(bytes, symbols, num_chars)` directly and iterate exactly `num_chars` times. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104228&sha=726ffb1642471e8ebfe17e1e802ddc1e09490cd2&name_0=PR&name_1=Unit%20tests%20%28asan_ubsan%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-05-12T23:43:03Z

@groeneai, investigate the chronic Possible deadlock on shutdown (see gdb.log) failure on Stress test (arm_debug): https://s3.amazonaws.com/clickhouse-test-reports/PRs/104228/726ffb1642471e8ebfe17e1e802ddc1e09490cd2/stress_test_arm_debug/gdb.log and provide a fix in a separate PR. This fails 8-10 times per day on master (CIDB) and is unrelated to this PR (the stuck threads are in absl::Mutex shutdown, no find_symbols frames). If a fix is already in progress, link it here.

groeneai · 2026-05-13T05:50:18Z

@alexey-milovidov investigation results — gdb.log analysis of PRs/104228/726ffb1.../stress_test_arm_debug/gdb.log.

Root cause: leaked local `ThreadPool` instances at server shutdown

Decoded the gdb.log (725 threads, 7.2MB). The deadlock is not in absl::Mutex shutdown itself — those frames are a side effect. The real blocker is in GlobalThreadPool::shutdown() which can't join its own std::threads because each is running a leaked local pool's worker() loop.

Main thread (LWP 499292)

#0  __futex_abstimed_wait_common64
#3  __pthread_clockjoin_ex                                              ← joining a worker thread
#5  std::thread::join (this=0xff337a7c8ae8)
#6  ThreadPoolImpl<std::thread>::ThreadFromThreadPool::join
#7  ThreadPoolImpl<std::thread>::finalize (this=0xff355d30bc00)        ← GlobalThreadPool::finalize
#8  Server::main::$_15::operator()                       Server.cpp:1482
#9  BasicScopeGuard::invoke
#11 Server::main                                          Server.cpp:3296

Server.cpp:1482 is the last scope guard:

SCOPE_EXIT_SAFE({
    LOG_INFO(log, "Waiting for background threads");
    GlobalThreadPool::instance().shutdown();      // <-- stuck here
    LOG_INFO(log, "Background threads finished in {} ms", watch.elapsedMilliseconds());
});

The thread it's waiting on (LWP 500233 "ThreadPool")

#0  __futex_abstimed_wait_common64
#5  std::condition_variable::wait
#7  std::condition_variable::wait<...lambda#2>(unique_lock, pred)
#8  ThreadPoolImpl<ThreadFromGlobalPoolImpl<false,true>>::ThreadFromThreadPool::worker
                                                          ThreadPool.cpp:736   ← inner pool's wait loop
#15 ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker
                                                          ThreadPool.cpp:809   ← outer std::thread running the job

The std::thread (frame #15, owned by GlobalThreadPool) is running a job which is "be the inner pool's worker forever" (frame #8 at line 736). The inner pool's parent_pool.new_job_or_shutdown.wait(...) only exits when parent_pool.shutdown is set. GlobalThreadPool::shutdown() sets its OWN shutdown, not the inner pool's — those are independent.

Scale of the leak (this run)

56 worker threads stuck at ThreadPool.cpp:736 across 7 distinct leaked pools:

Mutex addr	Stuck workers
0xff34114a6b00	16
0xff34114a6800	16
0xff34114a6c80	8
0xff34114a6980	8
0xff355d312e00	3 ← main thread is waiting on a worker of this pool
0xff355d312c80	3
0xff355d312f80	2

All threads are named "ThreadPool" (the default — setThreadName runs inside the worker loop AFTER receiving a job, but these never received one). So the leaked pools can't be identified by thread name alone.

Why this matches the chronic 8-10/day rate

Local ThreadPool (= ThreadPoolImpl<ThreadFromGlobalPool>) instances must be destroyed (which calls finalize() → sets shutdown → notifies) before GlobalThreadPool::shutdown() runs. Some subsystem(s) hold a reference to a local ThreadPool past the scope guard at Server.cpp:1482. Likely candidates: long-lived singletons, registries, or thread-local resources that aren't drained in Server::main cleanup. Different stress runs hit different leakers, which is consistent with this being an umbrella affecting multiple subsystems (8-10/day spread across several specific patterns).

The `absl::Mutex` frames you saw

gRPC has its own WorkStealingThreadPool. 17 threads in this gdb.log are stuck in gRPC's WorkSignal::WaitWithTimeout (which uses absl::Mutex). They're not the blocker — they're just idle workers of gRPC's pool that haven't been told to exit because gRPC's EventEngine global is presumably one of the things that doesn't clean up before the Server::main scope guard. They're a symptom of the same broader pattern (subsystem lifetimes not properly ordered against GlobalThreadPool shutdown), not an independent deadlock.

Direction for fix

Two angles, complementary:

Targeted (per-subsystem): Identify the leaked local pools. The most practical way is to add a LOG_INFO listing name, num_threads, scheduled_jobs for every ThreadPoolImpl still alive immediately before the GlobalThreadPool::shutdown() call (or in GlobalThreadPool::finalize itself, before notify_all). When chronic hits land in CI, the log will name the leaker(s). Possibly enable behind a debug flag to avoid prod overhead. Then fix the lifetime in each named owner.
Structural: A global registry of live ThreadPoolImpl<ThreadFromGlobalPool> instances; GlobalThreadPool::shutdown() walks the registry, sets shutdown=true and notifies each registered pool's condvar before joining its own std::threads. This breaks the dependency on subsystem destruction order. Slight added cost per pool construction/destruction (register/unregister + lock); modest. Downside: papers over real lifetime bugs.

Recommendation: do (1) first to identify the offenders, then (2) as a backstop if the list of leakers is long-tailed.

Not pushing a fix yet

Per our policy, no fix PR without local reproduction first. Will:

Build a debug binary with the diagnostic LOG_INFO patch from (1) and run stress locally to confirm at least one specific leaker
Open a fix PR for whichever subsystem(s) show up
If the list is long, also open (2) as a backstop

Tracking under task 2026-04-01-ci-hung-check-deadlock-chronic-trunk-is. CIDB rate (14d): 8-10/day on master with a 78-hit spike on May 11 (concentrated in 2 PRs, possibly a single bad run). Will keep this thread updated.

alexey-milovidov · 2026-05-13T09:34:24Z

@groeneai, investigate the failure: 01410_nullable_key_and_index on Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel) and provide a fix in a separate PR. This is unrelated to this PR (the diff is row ordering of a Nullable key under ParallelReplicas, no find_symbols frames). It is flaky on master (CIDB shows 4 hits in the last 30 days across 4 different PRs). If a fix is already in progress, link it here.

alexey-milovidov · 2026-05-13T17:10:42Z

@groeneai, investigate the CH Inc sync failure: the sync PR ClickHouse/clickhouse-private#57269 fails on test_statistics_cache/test_smt.py::test_statistics_cache_smt (Integration tests (amd_asan_ubsan, db disk, old analyzer, 5/6)). This is unrelated to this PR — the diff is entirely under base/base/find_symbols.h (AArch64/NEON path + SSE2 zero-pad fix) and has no contact with the SharedMergeTree statistics cache. Please look into the chronic failure and provide a fix in a separate PR. If a fix is already in progress, link it here.

Address review feedback on PR #104228: `mm_is_in_prepare` documents the assumption that `num_chars >= 1`, but `SearchSymbols` still allows an empty needle. With `num_chars == 0` on the `__SSE2__` runtime path, the prepared vector remains effectively zero-filled, so `mm_is_in_execute` would match embedded `\0` bytes in SIMD chunks. Add an explicit `symbols.str.empty()` fast path to both `find_first_symbols_dispatch` and `find_last_symbols_dispatch` (the runtime-needle entry points) so the SIMD body is never invoked with an empty needle. The fast path matches the scalar semantics: for positive search nothing is found (returns end/nullptr); for the negative variant the first (or last, in the find-last case) byte qualifies. Cover the case with a new `FindSymbols.EmptyRunTimeNeedle` gtest that exercises short, empty, and long (>= 16 bytes, with embedded `\0`) haystacks across all six runtime-needle entry points.

Fixes the build failure on `Build (amd_debug)`, `Build (amd_asan_ubsan)`, `Build (amd_tsan)`, `Build (amd_msan)`, and `Build (llvm_coverage_build)` introduced by 65e6fb5: error: default initialization of an object of const type 'const SearchSymbols' without a user-provided default constructor `SearchSymbols` carries a `__m128i simd_vector` member under `__SSE4_2__` and an explicitly-defaulted `SearchSymbols() = default;` constructor, so default-initializing a `const` instance leaves `simd_vector` uninitialized and the compiler rejects the form. Switch the test to value-initialization (`SearchSymbols empty{}`), which zero-initializes the trivial members and is well-formed for `const`. The runtime semantics are unchanged: the new `symbols.str.empty()` fast path in `find_first_symbols_dispatch` / `find_last_symbols_dispatch` already bypasses the SIMD body, so the value of `simd_vector` is never observed for the empty-needle case the test exercises. CI report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104228/65e6fb580a6f05bb6b3822bb8a0902e6f3cccd91/build_amd_debug/build_clickhouse/build_clickhouse.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-14T19:39:09Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.10%	+0.00%
Functions	92.00%	92.00%	+0.00%
Branches	76.50%	76.60%	+0.10%

Changed lines: 100.00% (109/109) · Uncovered code

Full report · Diff report

alexey-milovidov · 2026-05-14T23:47:22Z

Improvements exactly where we expected:

groeneai · 2026-05-15T18:51:09Z

@alexey-milovidov, investigation of the test_statistics_cache_smt failure on CH Inc sync PR clickhouse-private#57269.

What we can see on the public side

test_statistics_cache (the public MergeTree sibling — tests/integration/test_statistics_cache/test.py) has 0 master failures in the last 60 days. CIDB query:

SELECT countIf(pull_request_number=0) AS master, countIf(pull_request_number>0) AS prs
FROM default.checks
WHERE test_name LIKE '%test_statistics_cache%'
  AND test_status IN ('FAIL','ERROR')
  AND check_start_time > now() - INTERVAL 60 DAY
-- master=0, prs=1048, distinct_prs=15

All 1048 PR failures are concentrated on 2 PRs, both unmerged:

Notably your own PR #100185 makes every test_statistics_cache sub-test fail — the assertions you added fire deep inside the cache's statistics-loading path. PR #100185 touches ColumnReplicated.{cpp,h} among many other column types, which is the column used by SMT replication. Likely the same latent issue your assertions are designed to surface (column shared-ownership violation during IColumn::mutate) also exists in code reachable from the SMT statistics-load path.

What we can't see

The failing report URL is https://d1k2gkhrlfqv31.cloudfront.net/clickhouse-test-reports-private/... which returns HTTP 401 (Basic auth required) — we can't read the private CI artifact. The test file test_statistics_cache/test_smt.py is also in the private repo only (public has just test.py).

The shared code path

The LoadedStatisticsMicroseconds ProfileEvent that drives _assert_load lives in MergeTreeData::getConditionSelectivityEstimatorByPredicate (src/Storages/MergeTree/MergeTreeData.cpp:864). MergeTreeData is the public base for both StorageMergeTree and StorageSharedMergeTree, so the cache (cached_estimator, stats_mutex, refreshStatistics) is shared infrastructure. SMT-only divergence will be in:

IMergeTreeDataPart::loadStatistics() overrides on the SMT-side DataPartShared/equivalent
the part-discovery code that feeds getDataPartsVectorForInternalUsage under SMT

Recent public master commits touching this path that landed after the last green state and could be the sync regressor:

78d10eb43cd statistics retry after failure (@zoomxi, 2026-05-08) — restructures IMergeTreeDataPart::getEstimates to build a local new_estimates and only move on success; adds merge_tree_load_statistics_throw fail point. If the SMT override of loadStatistics throws once and the new retry contract isn't honored, the assertion _loaded_stats_us > 0 will silently fail.
801397d3c11 / 65ad40dfe08 / e9f27104b27 — null_count part-pruning + auto-statistics randomization (@zoomxi, PR Add NullCount statistics support for part pruning #104214). Changes how parts are pruned before statistics loading.

What we need from you

Without the private CI artifact we're guessing. To root-cause this we need any of:

The exact failing assertion line + stack trace from test_statistics_cache_smt (paste it in the comment).
The LoadedStatisticsMicroseconds value the failing query reports (0? or non-zero but cached unexpectedly?).
Confirmation of whether the sync PR includes commit 78d10eb43cd and/or 801397d3c11.
Read access to the private report bucket, or a re-publish of the failing test output as a public gist.

If (1) shows the failure is on the same _assert_load line and _loaded_stats_us == 0, the most likely public-side suspect is the new retry contract in 78d10eb43cd not propagating through the SMT-side part override. We'd then fix it on master and request a re-sync.

PR ClickHouse#104228 added a NEON SIMD path to find_symbols.h. On AArch64 the inline dispatcher tail-calls an out-of-line ([[gnu::noinline]]) helper whenever end - begin >= 16; that helper ran an 8-byte scalar prefix and then the NEON loop. That guard is the wrong signal for format readers (CSV/TSV/JSON). They pass a haystack that ends at the far read-buffer end, but whose delimiter/quote is within the first few bytes of the current field. end - begin is large, so every short field pays a branch to out-of-line code where master used a tiny inline scalar loop. This is the regression reported in ClickHouse#106605 for the tsv_csv_nullable_parsing query 1 perf test on ARM (CSV Nullable(String), 1M short rows): about +18..30% on Neoverse-V2. Move the 8-byte scalar prefix out of the out-of-line helper and into the inline dispatcher, then tail-call the helper (now the NEON loop only) at begin + 8 (forward) / end - 8 (backward). Short fields whose delimiter is in the first 8 bytes now resolve fully inline with no SIMD call and no repeated work, while sparse long-haystack scans reach the NEON loop unchanged. The [[gnu::noinline]] attribute now guards only the SIMD loop, which is its intended purpose: keeping the vector loop out of short-string callers such as trim. Applied to all four AArch64 dispatch sites (find_first / find_last, compile-time and runtime needles). x86 paths are unchanged. Validated by cross-compiling for aarch64 (clang-21) and running the NEON code under qemu-aarch64: 2.4M randomized/exhaustive correctness checks against a scalar reference with 0 mismatches, and the existing gtest_find_symbols suite passes. Microbenchmarks behind a real call site show the short-field CSV case beats master at every delimiter offset 0..16 while long-haystack throughput is unchanged (1.13 GB/s, identical to master). Closes: ClickHouse#106605 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The trim_numbers performance test query 1 (ltrim(toString(number))) regressed ~12-14% on ARM/aarch64 after the NEON rewrite of find_first_not_symbols (PR ClickHouse#104228). Decimal strings have no leading whitespace, so default trimLeft is a no-op that still calls the per-row symbol scan and copies the whole column into a fresh ColumnString. Detect this case cheaply: default left-trim removes only leading spaces, so a row is unchanged iff its first byte is not a space. When no row would change, return the input column directly and skip the per-row scan and the full copy. This sidesteps the regressed code path entirely for the common case and avoids materializing an identical column. Other paths (trimRight, trimBoth, custom trim characters, FixedString) are unchanged. The mixed-column case (some row starts with a space) falls back to the normal path, so leading spaces are still removed. Closes: ClickHouse#106601 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The trim_numbers performance test query 1 (ltrim(toString(number))) regressed ~12-14% on ARM/aarch64 after the NEON rewrite of find_first_not_symbols (PR ClickHouse#104228). Decimal strings have no leading whitespace, so default trimLeft still paid the per-row SIMD scan setup for a no-op. Restructure FunctionTrim into three separate non-template paths (vectorSpace, vectorCustom, vectorFixed) instead of the template-with-if-constexpr machinery plus executeImpl. The default (space) path now advances begin/end with a scalar while loop comparing single bytes, which beats the SIMD scan for short strings and exits after one comparison when there is no leading space. Output capacity is reserved once and the size is set after the loop, removing the per-row resize. trim_left/trim_right are cheap runtime ifs in the loop body. Custom trim character sets keep O(1) table lookups; FixedString is handled by a single non-template function. Results are unchanged for all inputs (verified against the regex baseline and under ASan/UBSan). Closes: ClickHouse#106601 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 6, 2026

alexey-milovidov requested a review from Algunenano May 6, 2026 14:12

Algunenano self-assigned this May 6, 2026

alexey-milovidov marked this pull request as ready for review May 6, 2026 14:48

Algunenano approved these changes May 6, 2026

View reviewed changes

alexey-milovidov and others added 2 commits May 6, 2026 21:17

Algunenano reviewed May 7, 2026

View reviewed changes

Comment thread base/base/find_symbols.h

alexey-milovidov and others added 3 commits May 7, 2026 18:58

Merge remote-tracking branch 'origin/master' into find-symbols-neon

4b0309c

Merge remote-tracking branch 'origin/master' into find-symbols-neon

81ef64f

clickhouse-gh Bot reviewed May 9, 2026

View reviewed changes

Comment thread base/base/find_symbols.h Outdated

alexey-milovidov added 3 commits May 9, 2026 14:25

Merge remote-tracking branch 'origin/master' into find-symbols-neon

cc0f25b

Merge remote-tracking branch 'origin/master' into find-symbols-neon

b045a07

groeneai mentioned this pull request May 11, 2026

New functions isPrime and isProbablePrime #104234

Merged

1 task

alexey-milovidov and others added 2 commits May 11, 2026 09:20

Merge remote-tracking branch 'origin/master' into find-symbols-neon

0af056f

clickhouse-gh Bot reviewed May 12, 2026

View reviewed changes

Comment thread base/base/find_symbols.h

alexey-milovidov and others added 2 commits May 12, 2026 09:39

Algunenano approved these changes May 13, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into find-symbols-neon

2a20976

groeneai mentioned this pull request May 13, 2026

Import silk library #104810

Merged

Merge remote-tracking branch 'origin/master' into find-symbols-neon

dadff21

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Comment thread base/base/find_symbols.h

alexey-milovidov and others added 4 commits May 14, 2026 06:14

Merge remote-tracking branch 'origin/master' into find-symbols-neon

54112e1

Merge remote-tracking branch 'origin/master' into find-symbols-neon

06afc7d

alexey-milovidov added this pull request to the merge queue May 14, 2026

Merged via the queue into master with commit c1ccce2 May 15, 2026
167 checks passed

alexey-milovidov deleted the find-symbols-neon branch May 15, 2026 00:08

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 15, 2026

groeneai mentioned this pull request May 27, 2026

Log live local thread pools at GlobalThreadPool shutdown #105905

Merged

groeneai mentioned this pull request Jun 10, 2026

Fix SIMD NUL false match in find_first_symbols_or_null (extractKeyValuePairs LOGICAL_ERROR, STID 6488-5f6c) #103389

Closed

1 task

Search	Scalar (master)	NEON (this PR)	Speedup
`find_first_symbols<'\t'>`	2.78 GB/s	26.19 GB/s	9.4x
`find_first_symbols<4 chars>`	1.45 GB/s	15.09 GB/s	10.4x
`find_first_symbols<8 chars>`	1.28 GB/s	8.98 GB/s	7.0x
`find_first_symbols(runtime, 4 chars)`	1.07 GB/s	13.56 GB/s	12.7x
`find_first_symbols(runtime, 8 chars)`	0.62 GB/s	7.93 GB/s	12.8x

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

alexey-milovidov commented May 6, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microbenchmark (function-level, 64MB buffer, lowercase ASCII)

Real-query performance — ClickBench hits shard (1M rows of real web data: URL, Title, SearchPhrase)

Real-query performance — synthetic 10M-row workloads (1-1.5GB on disk)

Note on the dense-JSONEachRow regression

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Version info

Uh oh!

clickhouse-gh Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Algunenano left a comment

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented May 6, 2026

Uh oh!

groeneai commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 12, 2026

Uh oh!

groeneai commented May 13, 2026

Root cause: leaked local ThreadPool instances at server shutdown

Main thread (LWP 499292)

The thread it's waiting on (LWP 500233 "ThreadPool")

Scale of the leak (this run)

Why this matches the chronic 8-10/day rate

The absl::Mutex frames you saw

Direction for fix

Not pushing a fix yet

Uh oh!

alexey-milovidov commented May 13, 2026

Uh oh!

alexey-milovidov commented May 13, 2026

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 14, 2026

LLVM Coverage Report

Uh oh!

alexey-milovidov commented May 14, 2026

Uh oh!

Uh oh!

groeneai commented May 15, 2026

What we can see on the public side

What we can't see

The shared code path

What we need from you

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexey-milovidov commented May 6, 2026 •

edited by robot-clickhouse

Loading

Real-query performance — ClickBench `hits` shard (1M rows of real web data: URL, Title, SearchPhrase)

Note on the dense-`JSONEachRow` regression

clickhouse-gh Bot commented May 6, 2026 •

edited

Loading

Root cause: leaked local `ThreadPool` instances at server shutdown

The `absl::Mutex` frames you saw