Speed up ZSTD decompression on AArch64 (small-offset overlapping copies)#108049
Conversation
Bump the `contrib/zstd` submodule to include a NEON `vqtbl1q_u8` pattern-fill for small-offset overlapping copies in `ZSTD_wildcopy`. This replaces the 8-byte-per-iteration `COPY8` loop (16 bytes per iteration, no load in the loop) that otherwise dominates ZSTD decompression of integer and low-cardinality columns on AArch64. On Graviton 4 (Neoverse-V2), ZSTD level 1 decompression of ClickBench `hits` columns speeds up: UserID 2.30x, AdvEngineID 1.46x, ClientIP 1.42x, RegionID 1.27x; string and incompressible columns are unchanged. End-to-end on a 20M-row ZSTD `hits` table (single thread), a `UserID` scan is +50% and a 77-integer-column scan is +6.8%. Verified byte-exact and clean under ASan/UBSan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [a064fd4] Summary: ✅
AI ReviewSummaryThis PR bumps Final VerdictStatus: ✅ Approve |
|
@groeneai, provide a fix for |
|
@alexey-milovidov fix in #108075. Root cause: |
test_concurrent_reads runs 25 concurrent SELECT * queries against one accumulating Iceberg table while 15 Spark writers commit in parallel, for 10 rounds. By default each SELECT fans out to ~max_threads (number of CPU cores) internal read threads, so 25 simultaneous readers request hundreds of threads at once. On a loaded CI shard this oversubscribes the global thread pool: a single SELECT either gets starved past the 300s client receive_timeout (Return code 159, seen on the asan/db-disk/old-analyzer 5/6 shard for PRs ClickHouse#100651 and ClickHouse#108049 and on master) or the server rejects the query outright with CANNOT_SCHEDULE_TASK (Code 439, threads=1000, jobs=1000). Pin max_threads=4 on the concurrent SELECT so the readers' aggregate thread demand stays bounded. This does not change what the test verifies: all 25 readers still race all 15 writers across every round and the exact row-count assertion is unchanged. Only the per-query intra-read parallelism is capped, which is irrelevant to concurrent-read snapshot consistency. Locally reproduced both directions on a debug build: pristine test fails at round 2 with CANNOT_SCHEDULE_TASK; with the cap it passes all 10 rounds (~150s).
|
@thevar1able Let's bring it here too please and I'll review the 2 together. We should also consider sending the patches upstream once we are ok with them |
|
The only reason I skipped x86 is that the current testing machine is AArch64. Please do not merge x86 optimization here - better to do it in steps for more confidence. |
Performance PRs are tested both for x86 and ARM
Our baseline is v3, which is what that PR is introducing. We'd only need dynamic dispatch if there was a v4 method |
|
Performance tests don't use zstd for now, so they are pointless here. |
|
Anyway, this depends on #108093 |
Algunenano
left a comment
There was a problem hiding this comment.
I left comments in the PR themselves. They look good and resemble the code that we have (or decided not to have) for our LZ4 decompressor.
We should measure other columns as we do for LZ4 to verify there are no regressions and merge. And send it upstream
|
📊 Cloud Performance Report 🟢 AI verdict: This PR bumps the zstd library and nothing else. Three scan-heavy ClickBench queries improved in lockstep (Q4 -20.6%, Q15 -18.8%, Q28 -5.7%), all confirmed by both tests over large sample sizes. A uniform, same-direction speedup across independent scans is the expected signature of faster decompression lowering CPU cost per block read. No regressions were flagged. clickbench🟢 3 improved Flagged queries (3 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
|
@alexey-milovidov This is the streaming DISTINCT-in-order logical error ( Failing query: Root cause: a STREAM read returns parts in commit order, not sorting-key order, so its output is not globally sorted by the sorting key. #108568 ("Do not apply read-in-order optimizations to STREAM reads") guards exactly this path with |
|
We need more data to prove this optimization. In the worst case, we will have to make it adaptive, like our LZ4_decompress_faster. |
|
@alexey-milovidov Agreed. No-regression sweep plan, plus the adaptive fallback design. No-regression column sweep (Graviton 4, ZSTD level 1) Methodology mirrors
Pass criterion: no class below 1.00x within run-to-run noise. Strings and incompressible should be neutral since they exercise the non-overlap path and the FSE/Huffman decoder, untouched here. I cannot run this on the fleet (x86, where the NEON path compiles out to COPY8, and x86 is the separate out-of-scope zstd PR #4). It needs the AArch64 host. Do you want me to drive it there, or will you run it on the same machine as the original table? Correctness (done, arch-independent) I verified the Adaptive fallback, if the sweep shows a regression Structural difference from LZ4:
Precedent from our LZ4 NEON code: single |
|
Merged ClickHouse/zstd#3. We need to point this PR to |
|
@thevar1able The It is a pure pointer move, zero content change: the tree of One line on I do not have push access to this branch (it is on ClickHouse/ClickHouse, not a fork I control), so I cannot push it myself. Happy to open a PR against the branch with this bump if that is easier. |
Continue-PR status — CI triage (all reds are unrelated flakes)The PR itself is ready: AI Review ✅ Approve, @Algunenano APPROVED, 0 unresolved review threads, and the latest Cloud Performance Report is a 🟢 improvement (Q4 −20.6%, Q15 −18.8%, Q28 −5.7%, no regressions flagged) — which answers the earlier "we need more data / measure other columns" concern. The This is a pure
I re-ran the failed jobs (workflow run |
|
@alexey-milovidov done — fix is in a separate PR: #108978. Confirmed your root cause, with one refinement: the leak isn't a partial cleanup, it's that the EXIT-trap cleanup never runs. On timeout Fix: bound every client call in |
Continue-PR status — CI triage (only red is an unrelated stress flake)The PR remains ready to merge:
Only CI red on The workload/scheduler flake cluster flagged in the previous pass is now fixed in @groeneai's #108978 — thanks. No re-merge: merge-base Blocked only on the stress rerun going green and a maintainer merge. |

Speeds up ZSTD decompression on AArch64 (ARM) for integer and low-cardinality columns.
For small match offsets (overlap period < 16 bytes),
ZSTD_wildcopyfalls back to an 8-byte-per-iterationCOPY8loop. Profiling ZSTD level 1 decompression of ClickBenchhitsdata on Graviton 4 (Neoverse-V2) showed this path dominates integer and low-cardinality columns — around 75% of decompression time forClientIP— because their matches frequently have offsets of 4 or 8 bytes (runs of repeated fixed-width values). String columns are entropy-decode-bound and are not affected.This bumps the
contrib/zstdsubmodule (ClickHouse fork) to build the repeating pattern once in a NEON register and store 16 bytes per iteration, advancing the pattern with a singlevqtbl1q_u8table lookup and no load inside the loop. The scalarCOPY8path is unchanged on other targets, and the 16-byte stores stay within the existingWILDCOPY_OVERLENGTHslack.Decompression throughput (ZSTD level 1, real
hitscolumn bytes, Graviton 4):UserIDAdvEngineIDClientIPRegionIDEnd-to-end on a 20M-row ZSTD
hitstable (single thread): aUserIDscan is +50%, a 77-integer-column scan is +6.8%. Verified byte-exact and clean under AddressSanitizer and UndefinedBehaviorSanitizer; query results are identical before and after.The
contrib/zstdchange: ClickHouse/zstd#3Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Sped up ZSTD decompression on AArch64 (ARM) for columns with small match offsets, such as fixed-width integer columns, by vectorizing short-offset overlapping copies with NEON. For example, decompression of a
UInt64column is up to ~2.3x faster on AWS Graviton 4.Version info
26.7.1.453