Speed up ZSTD decompression on AArch64 (small-offset overlapping copies) by alexey-milovidov · Pull Request #108049 · ClickHouse/ClickHouse · GitHub
Skip to content

Speed up ZSTD decompression on AArch64 (small-offset overlapping copies)#108049

Merged
thevar1able merged 4 commits into
masterfrom
optimize-zstd-decompression-aarch64
Jul 2, 2026
Merged

Speed up ZSTD decompression on AArch64 (small-offset overlapping copies)#108049
thevar1able merged 4 commits into
masterfrom
optimize-zstd-decompression-aarch64

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 21, 2026

Copy link
Copy Markdown
Member

Speeds up ZSTD decompression on AArch64 (ARM) for integer and low-cardinality columns.

For small match offsets (overlap period < 16 bytes), ZSTD_wildcopy falls back to an 8-byte-per-iteration COPY8 loop. Profiling ZSTD level 1 decompression of ClickBench hits data on Graviton 4 (Neoverse-V2) showed this path dominates integer and low-cardinality columns — around 75% of decompression time for ClientIP — because their matches frequently have offsets of 4 or 8 bytes (runs of repeated fixed-width values). String columns are entropy-decode-bound and are not affected.

This bumps the contrib/zstd submodule (ClickHouse fork) to build the repeating pattern once in a NEON register and store 16 bytes per iteration, advancing the pattern with a single vqtbl1q_u8 table lookup and no load inside the loop. The scalar COPY8 path is unchanged on other targets, and the 16-byte stores stay within the existing WILDCOPY_OVERLENGTH slack.

Decompression throughput (ZSTD level 1, real hits column bytes, Graviton 4):

column type baseline patched speedup
UserID Int64 2624 MB/s 6053 MB/s 2.30x
AdvEngineID Int16 5990 MB/s 8712 MB/s 1.46x
ClientIP Int32 1857 MB/s 2607 MB/s 1.42x
RegionID Int32 2468 MB/s 3127 MB/s 1.27x
strings / incompressible ~1.00x

End-to-end on a 20M-row ZSTD hits table (single thread): a UserID scan is +50%, a 77-integer-column scan is +6.8%. Verified byte-exact and clean under AddressSanitizer and UndefinedBehaviorSanitizer; query results are identical before and after.

The contrib/zstd change: ClickHouse/zstd#3

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Sped up ZSTD decompression on AArch64 (ARM) for columns with small match offsets, such as fixed-width integer columns, by vectorizing short-offset overlapping copies with NEON. For example, decompression of a UInt64 column is up to ~2.3x faster on AWS Graviton 4.

Version info

  • Merged into: 26.7.1.453

Bump the `contrib/zstd` submodule to include a NEON `vqtbl1q_u8` pattern-fill
for small-offset overlapping copies in `ZSTD_wildcopy`. This replaces the
8-byte-per-iteration `COPY8` loop (16 bytes per iteration, no load in the loop)
that otherwise dominates ZSTD decompression of integer and low-cardinality
columns on AArch64.

On Graviton 4 (Neoverse-V2), ZSTD level 1 decompression of ClickBench `hits`
columns speeds up: UserID 2.30x, AdvEngineID 1.46x, ClientIP 1.42x,
RegionID 1.27x; string and incompressible columns are unchanged. End-to-end on
a 20M-row ZSTD `hits` table (single thread), a `UserID` scan is +50% and a
77-integer-column scan is +6.8%. Verified byte-exact and clean under ASan/UBSan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [a064fd4]

Summary:


AI Review

Summary

This PR bumps contrib/zstd to the ClickHouse zstd clickhouse branch merge commit that vectorizes the short-offset ZSTD_wildcopy path on AArch64. I found no unresolved correctness, safety, metadata, or performance-evidence issues requiring inline findings.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR. labels Jun 21, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, provide a fix for test_storage_iceberg_concurrent.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov fix in #108075.

Root cause: test_concurrent_reads runs 25 concurrent SELECT * per round, each fanning out to ~max_threads (core count) read threads, so the readers oversubscribe the global thread pool under load. On the asan/db-disk/old-analyzer 5/6 shard one SELECT is starved past the 300s client receive_timeout (the failure here and on #100651 and master); locally it's an outright CANNOT_SCHEDULE_TASK. Pinning max_threads=4 on the SELECT bounds the demand without weakening the test — all 25 readers still race the 15 writers and the row-count assertion is unchanged. Verified both directions on a debug build.

pull Bot pushed a commit to Haofei/ClickHouse that referenced this pull request Jun 22, 2026
test_concurrent_reads runs 25 concurrent SELECT * queries against one
accumulating Iceberg table while 15 Spark writers commit in parallel, for
10 rounds. By default each SELECT fans out to ~max_threads (number of CPU
cores) internal read threads, so 25 simultaneous readers request hundreds
of threads at once. On a loaded CI shard this oversubscribes the global
thread pool: a single SELECT either gets starved past the 300s client
receive_timeout (Return code 159, seen on the asan/db-disk/old-analyzer 5/6
shard for PRs ClickHouse#100651 and ClickHouse#108049 and on master) or the server rejects the
query outright with CANNOT_SCHEDULE_TASK (Code 439, threads=1000, jobs=1000).

Pin max_threads=4 on the concurrent SELECT so the readers' aggregate thread
demand stays bounded. This does not change what the test verifies: all 25
readers still race all 15 writers across every round and the exact row-count
assertion is unchanged. Only the per-query intra-read parallelism is capped,
which is irrelevant to concurrent-read snapshot consistency.

Locally reproduced both directions on a debug build: pristine test fails at
round 2 with CANNOT_SCHEDULE_TASK; with the cap it passes all 10 rounds
(~150s).
@Algunenano Algunenano self-assigned this Jun 22, 2026
@thevar1able thevar1able self-assigned this Jun 22, 2026
@thevar1able

Copy link
Copy Markdown
Member

FYI ClickHouse/zstd#4

@Algunenano

Copy link
Copy Markdown
Member

@thevar1able Let's bring it here too please and I'll review the 2 together. We should also consider sending the patches upstream once we are ok with them

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The only reason I skipped x86 is that the current testing machine is AArch64.
So, for AArch64, it is tested quite thoroughly; for x86, we don't have it.

Please do not merge x86 optimization here - better to do it in steps for more confidence.
Also, for x86, we might want to have dynamic dispatch.

@Algunenano

Copy link
Copy Markdown
Member

The only reason I skipped x86 is that the current testing machine is AArch64.
So, for AArch64, it is tested quite thoroughly; for x86, we don't have it.

Performance PRs are tested both for x86 and ARM

Also, for x86, we might want to have dynamic dispatch.

Our baseline is v3, which is what that PR is introducing. We'd only need dynamic dispatch if there was a v4 method

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Performance tests don't use zstd for now, so they are pointless here.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Anyway, this depends on #108093

@Algunenano Algunenano left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left comments in the PR themselves. They look good and resemble the code that we have (or decided not to have) for our LZ4 decompressor.

We should measure other columns as we do for LZ4 to verify there are no regressions and merge. And send it upstream

@clickhouse-gh

clickhouse-gh Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

🟢 AI verdict: improvement3 query(s) improved out of 38 analysed

This PR bumps the zstd library and nothing else. Three scan-heavy ClickBench queries improved in lockstep (Q4 -20.6%, Q15 -18.8%, Q28 -5.7%), all confirmed by both tests over large sample sizes. A uniform, same-direction speedup across independent scans is the expected signature of faster decompression lowering CPU cost per block read. No regressions were flagged.

clickbench

🟢 3 improved

Flagged queries (3 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
🟢 4 improvement 262 208 -20.6% <0.0001 cpu: zstd contrib bump; faster decompression plausibly cuts per-block scan cost. -20.6%, consistent across both tests.
🟢 15 improvement 245 199 -18.8% <0.0001 cpu: Scan-heavy query; -18.8% fits a faster zstd decompression path. Large, consistent shift.
🟢 28 improvement 6671 6292 -5.7% <0.0001 cpu: Long scan query; modest -5.7% gain consistent with faster zstd decompression. Confirmed by both tests.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: 9ec15c76-a77e-4dc0-b50a-816d93127ae7
  • MIRAI run: 4bb6db01-7bc4-4409-ad0b-eeefe2b9837a
  • PR check IDs:
    • clickbench_511049_1782836784
    • clickbench_511055_1782836784
    • clickbench_511062_1782836784
    • tpch_adapted_1_official_511072_1782836784
    • tpch_adapted_1_official_511084_1782836784
    • tpch_adapted_1_official_511096_1782836784

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov This is the streaming DISTINCT-in-order logical error (Equal values are not contiguous within the range assumed to be sorted), already fixed by #108568.

Failing query: SELECT count() ... FROM (SELECT DISTINCT * FROM t_streaming_test STREAM LIMIT 10). Stack: DistinctSortedStreamTransform::transform -> ColumnString::getEqualRangeEndAssumeSorted -> the checkEqualRangeEndAssumeSorted assertion in findEqualRangeEndAssumeSorted.h (added by #106502).

Root cause: a STREAM read returns parts in commit order, not sorting-key order, so its output is not globally sorted by the sorting key. checkSupportedReadingStep accepted the streaming ReadFromMergeTree and requested read-in-order, advertising a sorting-key order that does not hold and feeding unsorted data to DISTINCT-in-order. The assertion then aborts the server in debug/sanitizer builds (handled normally in release).

#108568 ("Do not apply read-in-order optimizations to STREAM reads") guards exactly this path with if (reading->getQueryInfo().isStream()) return nullptr;, and its regression test covers the identical SELECT DISTINCT * ... STREAM LIMIT shape. On review now.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

We need more data to prove this optimization. In the worst case, we will have to make it adaptive, like our LZ4_decompress_faster.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov Agreed. No-regression sweep plan, plus the adaptive fallback design.

No-regression column sweep (Graviton 4, ZSTD level 1)

Methodology mirrors src/Compression/examples/decompress_perf.cpp: real hits column bytes, each column compressed at level 1, min of N runs, baseline submodule vs patched, identical binary otherwise, MB/s per column. The PR table has 4 integer columns; the matrix expands to every type class, and the entropy-bound and incompressible columns are the ones that must stay >= 1.00x:

  • Integers, all widths: Int8/16/32/64, UInt* (CounterID, UserID, WatchID, AdvEngineID, RegionID, ClientIP, ResolutionWidth, ...)
  • LowCardinality: the categorical hits columns (dictionary index streams)
  • Date / DateTime: EventDate, EventTime
  • Float: where present
  • Strings, entropy-decode-bound: URL, Title, Referer, SearchPhrase
  • Incompressible: WatchID-like high-entropy columns

Pass criterion: no class below 1.00x within run-to-run noise. Strings and incompressible should be neutral since they exercise the non-overlap path and the FSE/Huffman decoder, untouched here.

I cannot run this on the fleet (x86, where the NEON path compiles out to COPY8, and x86 is the separate out-of-scope zstd PR #4). It needs the AArch64 host. Do you want me to drive it there, or will you run it on the same machine as the original table?

Correctness (done, arch-independent)

I verified the init/adv tables are byte-exact against the scalar src[k % diff] reference by simulating vqtbl1q_u8 in plain C: 125.8M output bytes across every overlap period diff=1..15 and length 1..4096, zero mismatches. So this is purely a throughput question.

Adaptive fallback, if the sweep shows a regression

Structural difference from LZ4: LZ4::decompress(..., PerformanceStatistics&) is called per block from the codec and the bandit times each block to select among decompressImpl<8/16/32>. ZSTD goes through the opaque ZSTD_decompress, and the overlap copy is ZSTD_wildcopy called per match inside the sequence decoder, so a block-level bandit is not a drop-in. Two options:

  1. In-wildcopy dispatch on (diff, length), cheap, no statistics. The short-offset branch already holds both. Gate NEON behind a length threshold so the fixed setup (two vld1q plus vqtbl1q to build the pattern) never loses to COPY8 on very short runs, falling back to scalar there. This is the direct analogue of LZ4 picking copy granularity by data characteristics, and it targets exactly the regime where 16-byte tbl can regress.
  2. Full LZ4-parity bandit at the CompressionCodecZSTD layer: expose a wildcopy-strategy selector in the contrib fork and thread a PerformanceStatistics accumulator through the codec. Heavier (contrib API plus codec changes), only if (1) is insufficient.

Precedent from our LZ4 NEON code: single vtbl1_u8 (8-byte) is a 6% Graviton win, but vtbl2_u8 at 16/32-byte was slower than scalar so we kept scalar there. This PR uses single-call vqtbl1q_u8 (16-byte, no in-loop load), which sidesteps the multiple-call cost that hurt LZ4, so it is plausibly a genuine 16-byte win. That contrast is also why the per-column sweep matters: 16-byte NEON tbl has been a coin-flip in this codebase.

@thevar1able

Copy link
Copy Markdown
Member

Merged ClickHouse/zstd#3. We need to point this PR to clickhouse branch in zstd repo, maybe @groeneai can handle this?

@groeneai

Copy link
Copy Markdown
Contributor

@thevar1able The clickhouse branch HEAD is now bde4f839008a59c3689493a0903c7251988cf41d (the zstd#3 merge commit), so that is what contrib/zstd should point at.

It is a pure pointer move, zero content change: the tree of bde4f839 (b2b1f684...) is byte-identical to the current submodule 1ced1f24, and 1ced1f24 is the merge's second parent. The gitlink just re-targets the permanent clickhouse branch instead of the now-merged PR-3 feature branch, so no rebuild risk.

One line on optimize-zstd-decompression-aarch64:

git -C contrib/zstd fetch origin clickhouse && \
git -C contrib/zstd checkout bde4f839008a59c3689493a0903c7251988cf41d && \
git add contrib/zstd && \
git commit -m "Point contrib/zstd at clickhouse branch (zstd#3 merged)" && git push

I do not have push access to this branch (it is on ClickHouse/ClickHouse, not a fork I control), so I cannot push it myself. Happy to open a PR against the branch with this bump if that is easier.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continue-PR status — CI triage (all reds are unrelated flakes)

The PR itself is ready: AI Review ✅ Approve, @Algunenano APPROVED, 0 unresolved review threads, and the latest Cloud Performance Report is a 🟢 improvement (Q4 −20.6%, Q15 −18.8%, Q28 −5.7%, no regressions flagged) — which answers the earlier "we need more data / measure other columns" concern. The contrib/zstd pointer is already correctly moved to the clickhouse-branch HEAD bde4f839 (commit "Update submodule pointer"), as @groeneai requested; its tree is byte-identical to the previously-pointed 1ced1f24.

This is a pure contrib/zstd submodule bump, so none of the CI reds can be caused by it. Triage (CIDB, last 10 days):

  1. Stateless arm_asan_ubsan, azure, sequential — 4 workload/scheduler tests failed together: 03821_drop_workload_during_query_race, 03562_cpu_scheduling_with_one_thread, 03232_workloads_and_resources, 03232_workload_create_and_drop. 0 failures on master; the identical 4-test cluster failed on the same config on unrelated PR Allow EXISTS on a dictionary with the SHOW DICTIONARIES privilege #108084. Root cause is global-workload-state cross-test leakage: 03821 creates the root WORKLOAD wl_03821 and runs a DROP/CREATE race loop with a best-effort EXIT cleanup (DROP WORKLOAD IF EXISTS wl_03821 2>/dev/null ||:); when that cleanup doesn't fully take, wl_03821 survives in the shared sequential server, so a later create workload all in 03232_workloads_and_resources fails with The second root is not allowed. You should probably add 'PARENT wl_03821' clause. — exactly the observed error. A ZSTD decompression speedup cannot produce a workload-graph DDL error.

    @groeneai, this is an unrelated CI flake with no tracking issue — could you investigate and provide a fix in a separate PR? Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108049&sha=a064fd421a09493c119fc473ef63c94064a6ca75&name_0=PR&name_1=Stateless%20tests%20%28arm_asan_ubsan%2C%20azure%2C%20sequential%29 (the same cluster is reproducible on Allow EXISTS on a dictionary with the SHOW DICTIONARIES privilege #108084).

  2. Stress test arm_msan — "Hung check failed, possible deadlock found" — known OPEN issue Hung check failed, possible deadlock found #107941 (linked in the CI report itself); 1150 hits across 442 PRs (210 on master) in 10 days. Unrelated.

  3. Upgrade check amd_release — "Error message in clickhouse-server.log" — a leftover test mutation retrying if(1, CAST(plus(x, throwIf(1)),'UInt64'), x) logs FUNCTION_THROW_IF_VALUE_IS_NON_ZERO, which the log scanner flags. 89 hits across 81 PRs in 10 days — broad upgrade-check log noise, not related to this PR.

I re-ran the failed jobs (workflow run 28447611330). Not re-merging master: merge-base is from today (dd89f87, 01:22Z), and none of the reds have a master-side fix that a re-merge would pick up, so a re-merge would only churn CI. The PR is ready to merge once CI clears.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov done — fix is in a separate PR: #108978.

Confirmed your root cause, with one refinement: the leak isn't a partial cleanup, it's that the EXIT-trap cleanup never runs. On timeout clickhouse-test sends SIGTERM then SIGKILL 0.1s later; 03821's trap cleanup EXIT doesn't fire on SIGTERM and 0.1s is too short, so when 03821 times out the root wl_03821 + the (WORKER THREAD, MASTER THREAD) rs_03821 leak on the shared sequential server. CIDB shows exactly that ordering on this PR and on #108084: 03821 fails with Timeout! Processes left in process group, then 03232_workloads_and_resources / 03232_workload_create_and_drop (second root) and 03562_cpu_scheduling_with_one_thread (second WORKER THREAD resource) fail together, return code 36, 0 on master.

Fix: bound every client call in 03821 with timeout so a query/DDL that wedges under the drop/create race can't block wait until the 180s cap (where SIGKILL skips cleanup). The test then finishes normally and the EXIT trap always drops the workload + resource. Verified locally: a leaked wl_03821/rs_03821 deterministically reproduces both sibling errors; patched test passes 50/50 under randomization with zero leftover state.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continue-PR status — CI triage (only red is an unrelated stress flake)

The PR remains ready to merge:

  • AI Review ✅ Approve, @Algunenano APPROVED, 0 unresolved review threads.
  • Latest Cloud Performance Report is a 🟢 improvement (Q4 −20.6%, Q15 −18.8%, Q28 −5.7%, no regressions flagged) — answering the earlier "we need more data / measure other columns" concern.
  • The contrib/zstd pointer is correctly on the clickhouse-branch HEAD bde4f839 (commit "Update submodule pointer" a064fd42); zero content change vs. the previous 1ced1f24.

Only CI red on a064fd42: Stress test (arm_msan)Logical error: Bad cast from type A to B (STID: 1499-32c3). This is an unrelated, broadly-flaky stress failure of the well-known testing,fuzz "Bad cast from type A to B" class (sibling open issues #108931 / #107598 / #103814 for other STIDs of the same class). CIDB for this exact STID over the last 10 days shows 4 occurrences across 4 distinct PRs and 3 different stress configs (amd_tsan, arm_msan, arm_debug), all on master / base_ref=master (06-22, 06-29, 06-30) — i.e. it reproduces independently of any PR. A pure contrib/zstd submodule bump (NEON vectorization of the short-offset decompress wildcopy) cannot produce a type-cast assertion deep in the pipeline executor, so this is not caused by this PR. Reran the failed job (run 28447611330 --failed).

The workload/scheduler flake cluster flagged in the previous pass is now fixed in @groeneai's #108978 — thanks.

No re-merge: merge-base dd89f87 is from today (06-30 01:22Z, < 1 day), and no master commit since then touches contrib/zstd or src/Compression, so a re-merge would pull in nothing relevant and would only re-churn CI on a master-wide flake.

Blocked only on the stress rerun going green and a maintainer merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@thevar1able thevar1able added this pull request to the merge queue Jul 2, 2026
Merged via the queue into master with commit 97308a0 Jul 2, 2026
510 of 514 checks passed
@thevar1able thevar1able deleted the optimize-zstd-decompression-aarch64 branch July 2, 2026 20:20
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants