iframe-proxy

alexey-milovidov · 2026-05-09T17:59:43Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

For the prealloc_serialized family of aggregation methods, precompute per-row hashes during the batch-serialization pass and use them to (a) skip rehashing in emplaceKey/findKey and (b) software-prefetch the next row's bucket. Speeds up multi-key string/serialized aggregation by hiding hash-table cache-miss latency.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Why

When HashMethodSerialized<prealloc=true> is chosen for aggregation, the row sizes are already collected up front so keys can be serialized in a single batch. The hash for each row is computed twice in this codepath — once implicitly while the hash table looks up the cell, then again every time we touch that row. We can compute it once during the batch pass and reuse it.

That precomputed hash also lets us prefetch the future row's bucket: by the time the loop reaches it, the cache line is warm.

What changed

Hash methods opt in via static constexpr bool has_pre_computed_hashes = true and provide hashes (a WeakHash32), prefetching (PrefetchingHelper), and prefetch_look_ahead.
HashMethodSerialized<prealloc=true> opts in. HashMethodSingleLowCardinalityColumn forwards the flag from its base. All other hash methods default to false via HashMethodBase.
ColumnsHashingImpl::emplaceKey/findKey: when has_pre_computed_hashes, prefetch the bucket look_ahead rows ahead and skip the rehash on the current row. For findKey, also test data.isEmptyCell(hash_value) to early-exit a miss.
A new emplaceImpl<bool compute_hash> overload chooses between data.emplace(...) and data.emplace(..., hash_value).
HashTable exposes prefetchByHash(hash) and isEmptyCell(hash) publicly; TwoLevelHashTable adds matching overloads that route to the right bucket.

Origin

Split out from WIP PR #81944. The original commit also tweaked the PrefetchingHelper::calcPrefetchLookAhead algorithm; that piece is left out here because master already evolved a different (integer-arithmetic) version of the same calibration and the precomputed-hash path works fine on top of it.

Version info

Merged into: 26.5.1.649

For the `prealloc_serialized` family of aggregation methods, the row sizes are already collected upfront so the keys can be serialized in a single batch pass. This commit takes advantage of that to also compute the per-row hashes ahead of time (`WeakHash32` from each key column), which then unlocks two speedups in the main aggregation loop: 1. **Skip hashing in `emplaceKey`/`findKey`.** The hash is already known, so the call goes through `data.emplace(key_holder, it, inserted, hash_value)` instead of recomputing. A new `emplaceImpl` overload takes the precomputed hash via a `compute_hash` template parameter. 2. **Software prefetch.** Each iteration prefetches the cache line for a future row's bucket using `data.prefetchByHash(hashes[row + look_ahead])`, hiding cache-miss latency on the hash table. The look-ahead distance is calibrated dynamically by `PrefetchingHelper`. For `findKey`, the prefetched cell is also tested via the new `data.isEmptyCell(hash_value)` (works on both `HashTable` and `TwoLevelHashTable`) so missed lookups exit before constructing the key holder. Hash methods opt in by declaring `static constexpr bool has_pre_computed_hashes = true` and providing `hashes`, `prefetching`, `prefetch_look_ahead` members. `HashMethodSerialized<prealloc=true>` opts in here; all others default to `false` via `HashMethodBase` and keep the existing path. `HashMethodSingleLowCardinalityColumn` forwards the flag from its underlying base. Originally part of WIP PR #81944.

clickhouse-gh · 2026-05-09T18:00:16Z

The prealloc serialized aggregation path was passing `WeakHash32` values (computed over source columns) to `data.emplace(..., hash_value)`, `data.prefetchByHash`, and `data.isEmptyCell`. Hash tables for serialized aggregation use `DefaultHash<std::string_view>` or `StringViewHash64` over the *serialized bytes*, so the hash mismatch placed entries in the wrong bucket and let `findKey` short-circuit when the key actually existed. Concretely, for `LowCardinality` columns the source `WeakHash32` is taken over dictionary indexes, so two rows with identical string values but different positions/indexes produced different hashes and never merged in the hash table. That broke `test_string_aggregation_compatibility` and caused wrong results across multi-key string/serialized aggregation. The fix: - Replace `WeakHash32 hashes` with `PaddedPODArray<size_t> precomputed_hashes` holding canonical per-row hashes from `Data::hash(serialized_keys[i])`. - Defer hash precomputation to the first `emplaceKey`/`findKey` call, where `Data` (and therefore its hash function) is known. - Only enable the precomputation when `use_batch_serialize` is true; otherwise fall back to the regular `data.prefetch(key_holder)` path. Verified locally that GROUP BY over `(UInt64, LowCardinality(String))`, multi-key `String` aggregation, and the `findKey` overflow path all return the expected number of groups. Addresses the AI review on #104475 and the `test_string_aggregation_compatibility` regressions reported in CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104475&sha=latest&name_0=PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fetch-prealloc

…sHashingImpl.h` `ColumnsHashingImpl.h` is a high-fanout header in `src/Common`, and the include of `<Interpreters/AggregationCommon.h>` added by the precomputed-hash change did not introduce any usage of symbols from that header — the prefetch code only needs `PrefetchingHelper` from `<Common/HashTable/Prefetching.h>`. Drop the include to avoid unnecessary transitive compile-time cost and the layering leakage of pulling an interpreter-layer header into common hashing code. Addresses review feedback on #104475 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fetch-prealloc

alexey-milovidov · 2026-05-11T17:54:20Z

I think this is great:

Tests that ran successfully (`NOT_FAILED` status) in the ParallelReplicas CI shard are no longer broken under parallel replicas and can be enabled by removing them from `tests/parallel_replicas_blacklist.txt`. Removed 67 test names and the comment annotations that referred specifically to those tests. Source: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104475&sha=latest&name_0=PR PR: ClickHouse#104475 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fetch-prealloc

…prefetch` for prealloc serialized The precomputed-hash path added for `HashMethodSerialized<prealloc=true>` turned on `data.prefetchByHash` based purely on `use_batch_serialize`, so `enable_software_prefetch_in_aggregation = 0` no longer disabled software prefetch for `prealloc_serialized`, and the path also bypassed the existing `min_bytes_for_prefetch` small-hashtable guard used by `Aggregator::executeImpl`. Thread both controls through `HashMethodContextSettings`: - `enable_prefetch` (set from `params.enable_prefetch`) gates whether `HashMethodSerialized` enters the precomputed-hash + prefetch path at all. - `min_bytes_for_prefetch` (set from `Aggregator::min_bytes_for_prefetch`) is checked lazily on the first `emplaceKey`/`findKey` once `Data` is known; if the hash table's buffer is below the threshold, fall back to the regular emplace path for the remainder of the batch.

…fetch-prealloc

alexey-milovidov · 2026-05-13T08:02:17Z

@groeneai, the CI failures on this PR look unrelated to the changes (which only touch hash table / aggregation code in src/Common/ColumnsHashing*, src/Common/HashTable/*, and src/Interpreters/Aggregator.cpp):

Stateless test 04221_dictgetordefault_arrayjoin_80703 — BAD_ARGUMENTS in ExternalDictionariesLoader::resolveDictionaryName. Stack trace is in dictionary resolution and analyzer code. Report.
Unit test FunctionsStress (tsan, function_prop_fuzzer) — Logical error: 'Function writeSlice expects same column types for GenericValueSlice and GenericArraySink.'. This is in array function code. Report.
AST fuzzer (amd_debug) — Inconsistent AST formatting between 'ExpressionList' and 'ExpressionList' involving extra parentheses around tuple(...) / not(...). Looks like the same class of issue as More inconsistent AST formatting #104605. Report.
AST fuzzer (arm_asan_ubsan) — Invalid number of rows in Chunk in JoiningTransform::transform. Report.

Please investigate these and provide fixes in separate PRs. If any are already in progress, please link them here. I just merged master into the branch, so CI will rerun shortly — if the failures are flaky they may not reproduce.

groeneai · 2026-05-13T08:51:35Z

@alexey-milovidov — investigated all four. All four are pre-existing chronic/trunk bugs unrelated to this PR (and two already have fix PRs in flight).

F1 — `04221_dictgetordefault_arrayjoin_80703` (BAD_ARGUMENTS)

Chronic-flaky on Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel) only. First seen 2026-05-12 (same day the test was added in 23d8c1f "Add test for #80703"). CIDB 30-day count: 21 failures across 18 PRs + 4 master hits, ALL on that one check. The error is Dictionary (tags_80703) not found during parallel-replica execution — the dictionary is created and immediately queried, and remote replicas don't see it in the per-query database scope.

Already in flight: PR #104778 (you, 2026-05-13 02:55 UTC) — Tag 04221_dictgetordefault_arrayjoin_80703 as no-parallel-replicas. MERGEABLE, +3 lines. Will resolve it once merged. No separate fix needed from us.

F2 — `FunctionsStress` (tsan, function_prop_fuzzer) — `writeSlice expects same column types for GenericValueSlice and GenericArraySink`

Rare chronic. 30-day CIDB: 3 hits total (this PR, PR #96886 on 2026-05-11, and one master hit 2026-04-17 in BuzzHouse (experimental, serverfuzz, amd_msan) with the GenericArraySlice/Sink variant). 90-day: ~11 hits, all on function_prop_fuzzer or BuzzHouse jobs.

This is a pre-existing limitation of FunctionIf::executeGenericArray (src/Functions/if.cpp:637 → GatherUtils::conditional → Algorithms.h:470 → writeSlice at Algorithms.h:83). The function_prop_fuzzer constructs if(cond, value_of_TypeA, array_of_TypeB) argument shapes that bypass the normal type-unification done by the analyzer, hitting the Same column types assertion. We previously fixed one variant in #103084 (QBit structureEquals); the residual GenericValueSlice case is a different code path (scalar-vs-array, not column-equality). PR #104352 ("Improve function_prop_fuzzer", alexbakharew) was merged recently and likely contributed to its slightly higher visibility, but the underlying engine assertion has been there since at least Feb 2026.

Not actionable on this PR's timeline; needs proper handling in executeGenericArray to wrap mismatched-type "then"/"else" branches before calling conditional. I'll keep tracking via CIDB.

F3 — AST fuzzer (amd_debug) `Inconsistent AST formatting between 'ExpressionList' and 'ExpressionList'` (tuple/not parens)

Confirmed STID 1941-1bfa, "NOT-tuple subvariant" — same class as #104605 as you noted. Already tracked on our side; root cause and minimal reproducer found yesterday on your directive on PR #104591:

Minimal reproducer (debug build, exit 134):

CREATE TABLE t (c0 Int CODEC(not((not(materialize(1), materialize(2)))), ZSTD)) ENGINE=Memory

The CODEC arg parser produces Function_not(is_operator=true, 2 args) directly, while the SELECT/RoundBrackets path wraps not(a, b) into Function_not(1 arg) → Function_tuple(is_operator=true, 2 args). In CODEC context the formatter has frame.allow_operators=false, which blocks Function_tuple's operator-form (a, b) (ASTFunction.cpp:343), so it falls back to tuple(a, b). On re-parse, that re-parsed tuple(...) gets parenthesized=true from the surrounding (...), and IAST::decideParensEmission adds one extra (...) on the next round-trip → mismatch at executeQuery.cpp:1330.

PR #104626 (already open from us) only covers the lambda-after-comma subvariant of #104605. The NOT/tuple subvariant needs a separate small fix targeting either decideParensEmission to skip the extra parens when the parent forces function-call form, or harmonizing the CODEC vs SELECT parser to produce the same shape for not(a, b). I'll prepare that PR (option (a) is the most localized).

F4 — AST fuzzer (arm_asan_ubsan) `Invalid number of rows in Chunk` in `JoiningTransform::transform`

Confirmed STID 3366-2d1d family — same logical error as the 2026-04-23 master hit in BuzzHouse (amd_msan) already tracked on our side, and same family as previously-closed #84198 / #51312 (ScatteredBlock::filter indexing, fixed by @vdimir in #84213 for the parallel_hash thin-key variant). The fuzzer query on this PR is different from the April hit but produces the same column-size mismatch on the output Chunk:

expected 65409, got 130818  (column at position 4, exactly 2× — cross-product duplication not propagated to the other output columns)

60-day CIDB: 2 master hits, ~4 PR hits — rare chronic. We previously attempted local reproduction (~150 iterations across max_threads, max_block_size, parallel_hash, enable_lazy_columns_replication, join_runtime_filter_exact_values_limit) and couldn't trigger it without the full BuzzHouse oracle wrapper. Our note from that investigation: the likely site is ScatteredBlock::filter / appendRightColumns in the offsets.empty() branch when a bloom-filter false positive interacts with multi-slot ConcurrentHashJoin. We're keeping this under CIDB watch and will escalate to @vdimir / @KochetovNicolai / @Algunenano if we get a deterministic repro or hit-rate climbs.

Summary: None caused by this PR. F1 fixed by your #104778 (just needs merge). F3 follow-up fix PR coming from us this week. F2 and F4 stay on the chronic-trunk monitoring list. I'll comment on this PR again only if any of the four reoccurs after #104778 lands.

Address review: previously the hot path in `emplaceKey`/`findKey` checked `derived.can_precompute_hashes` twice (once to enter the block, once after the lazy-init block had a chance to flip it to `false`). The compiler cannot easily fold the two checks because `initPrecomputedHashes` is `NO_INLINE` and may mutate `can_precompute_hashes`. Switch to a single-check pattern: `precomputed_hashes_initialized` starts `true` by default and is set to `false` in the constructor only when we plan to precompute. After the first call it is flipped back to `true` (regardless of whether hashes were actually computed), so subsequent rows only do the one `can_precompute_hashes` check in the hot path. Also fold the `min_bytes_for_prefetch` size-threshold check into `initPrecomputedHashes` itself, which is the natural place for it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test was removed from `tests/parallel_replicas_blacklist.txt` in ClickHouse#104475 based on a single green run, but it returns rows from `nullable_minmax_index` without `ORDER BY`, and the reference file pins a particular insertion-order output. Under parallel replicas the merging order across shards is not the same as the local read order, so the test fails again on the `amd_llvm_coverage, ParallelReplicas, s3 storage, parallel` configuration. Observed failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=d02f6c889b3cbe833afc358e1828309add8fcca5&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20ParallelReplicas%2C%20s3%20storage%2C%20parallel%29 PR that re-enabled the test: ClickHouse#104475 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fetch-prealloc

`emplaceKey`/`findKey` previously fired `calcPrefetchLookAhead` only at the absolute row `PrefetchingHelper::iterationsToMeasure()`. That is wrong on sliced ranges (e.g. `executeOnBlockSmall` with non-zero `row_begin`): ranges starting after row 100 never calibrate, and ranges with `row_begin != 0` calibrate at the wrong moment. Compute the calibration row from the first row we actually process, matching the `row_begin + iterationsToMeasure` pattern already used in `Aggregator::executeImplBatch`. Addresses review: #104475

clickhouse-gh · 2026-05-14T09:32:53Z

LLVM Coverage Report

Changed lines: 74.24% (49/66) | lost baseline coverage: 1 line(s) · Uncovered code

Full report · Diff report

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 9, 2026

clickhouse-gh Bot reviewed May 9, 2026

View reviewed changes

Comment thread src/Common/ColumnsHashingImpl.h Outdated

alexey-milovidov and others added 2 commits May 9, 2026 23:25

Merge remote-tracking branch 'origin/master' into precompute-hash-pre…

5e0815d

…fetch-prealloc

clickhouse-gh Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/Common/ColumnsHashingImpl.h Outdated

alexey-milovidov and others added 2 commits May 10, 2026 17:48

Merge remote-tracking branch 'origin/master' into precompute-hash-pre…

37354a6

…fetch-prealloc

groeneai mentioned this pull request May 11, 2026

New functions isPrime and isProbablePrime #104234

Merged

1 task

alexey-milovidov mentioned this pull request May 11, 2026

Enable previously-blacklisted tests for parallel replicas #104627

Merged

1 task

murphy-4o mentioned this pull request May 12, 2026

Fix flaky 02504_regexp_dictionary_ua_parser on parallel replicas + s3 #104666

Merged

1 task

Merge remote-tracking branch 'origin/master' into precompute-hash-pre…

a1b0798

…fetch-prealloc

alexey-milovidov mentioned this pull request May 12, 2026

Re-blacklist 01410_nullable_key_and_index for parallel replicas #104679

Merged

1 task

Merge remote-tracking branch 'origin/master' into precompute-hash-pre…

e12e9a5

…fetch-prealloc

Algunenano self-assigned this May 12, 2026

clickhouse-gh Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/Common/ColumnsHashingImpl.h

alexey-milovidov added 2 commits May 13, 2026 00:01

Merge remote-tracking branch 'origin/master' into precompute-hash-pre…

3247608

…fetch-prealloc

Algunenano reviewed May 13, 2026

View reviewed changes

Comment thread src/Common/ColumnsHashingImpl.h Outdated

Merge remote-tracking branch 'origin/master' into precompute-hash-pre…

c75d372

…fetch-prealloc

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/Common/ColumnsHashingImpl.h Outdated

Algunenano approved these changes May 14, 2026

View reviewed changes

alexey-milovidov added this pull request to the merge queue May 14, 2026

Merged via the queue into master with commit 95cea43 May 14, 2026
167 checks passed

alexey-milovidov deleted the precompute-hash-prefetch-prealloc branch May 14, 2026 14:28

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label May 14, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Precompute hash and prefetch for prealloc aggregation variants#104475

Precompute hash and prefetch for prealloc aggregation variants#104475
alexey-milovidov merged 12 commits into
masterfrom
precompute-hash-prefetch-prealloc

alexey-milovidov commented May 9, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

clickhouse-gh Bot commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 11, 2026

Uh oh!

Uh oh!

alexey-milovidov commented May 13, 2026

Uh oh!

groeneai commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

alexey-milovidov commented May 9, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Why

What changed

Origin

Version info

Uh oh!

clickhouse-gh Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 11, 2026

Uh oh!

Uh oh!

alexey-milovidov commented May 13, 2026

Uh oh!

groeneai commented May 13, 2026

F1 — 04221_dictgetordefault_arrayjoin_80703 (BAD_ARGUMENTS)

F2 — FunctionsStress (tsan, function_prop_fuzzer) — writeSlice expects same column types for GenericValueSlice and GenericArraySink

F3 — AST fuzzer (amd_debug) Inconsistent AST formatting between 'ExpressionList' and 'ExpressionList' (tuple/not parens)

F4 — AST fuzzer (arm_asan_ubsan) Invalid number of rows in Chunk in JoiningTransform::transform

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 14, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexey-milovidov commented May 9, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented May 9, 2026 •

edited

Loading

F1 — `04221_dictgetordefault_arrayjoin_80703` (BAD_ARGUMENTS)

F2 — `FunctionsStress` (tsan, function_prop_fuzzer) — `writeSlice expects same column types for GenericValueSlice and GenericArraySink`

F3 — AST fuzzer (amd_debug) `Inconsistent AST formatting between 'ExpressionList' and 'ExpressionList'` (tuple/not parens)

F4 — AST fuzzer (arm_asan_ubsan) `Invalid number of rows in Chunk` in `JoiningTransform::transform`