Precompute hash and prefetch for prealloc aggregation variants by alexey-milovidov · Pull Request #104475 · ClickHouse/ClickHouse · GitHub
Skip to content

Precompute hash and prefetch for prealloc aggregation variants#104475

Merged
alexey-milovidov merged 12 commits into
masterfrom
precompute-hash-prefetch-prealloc
May 14, 2026
Merged

Precompute hash and prefetch for prealloc aggregation variants#104475
alexey-milovidov merged 12 commits into
masterfrom
precompute-hash-prefetch-prealloc

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 9, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

For the prealloc_serialized family of aggregation methods, precompute per-row hashes during the batch-serialization pass and use them to (a) skip rehashing in emplaceKey/findKey and (b) software-prefetch the next row's bucket. Speeds up multi-key string/serialized aggregation by hiding hash-table cache-miss latency.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Why

When HashMethodSerialized<prealloc=true> is chosen for aggregation, the row sizes are already collected up front so keys can be serialized in a single batch. The hash for each row is computed twice in this codepath — once implicitly while the hash table looks up the cell, then again every time we touch that row. We can compute it once during the batch pass and reuse it.

That precomputed hash also lets us prefetch the future row's bucket: by the time the loop reaches it, the cache line is warm.

What changed

  • Hash methods opt in via static constexpr bool has_pre_computed_hashes = true and provide hashes (a WeakHash32), prefetching (PrefetchingHelper), and prefetch_look_ahead.
  • HashMethodSerialized<prealloc=true> opts in. HashMethodSingleLowCardinalityColumn forwards the flag from its base. All other hash methods default to false via HashMethodBase.
  • ColumnsHashingImpl::emplaceKey/findKey: when has_pre_computed_hashes, prefetch the bucket look_ahead rows ahead and skip the rehash on the current row. For findKey, also test data.isEmptyCell(hash_value) to early-exit a miss.
  • A new emplaceImpl<bool compute_hash> overload chooses between data.emplace(...) and data.emplace(..., hash_value).
  • HashTable exposes prefetchByHash(hash) and isEmptyCell(hash) publicly; TwoLevelHashTable adds matching overloads that route to the right bucket.

Origin

Split out from WIP PR #81944. The original commit also tweaked the PrefetchingHelper::calcPrefetchLookAhead algorithm; that piece is left out here because master already evolved a different (integer-arithmetic) version of the same calibration and the precomputed-hash path works fine on top of it.

Version info

  • Merged into: 26.5.1.649

For the `prealloc_serialized` family of aggregation methods, the row
sizes are already collected upfront so the keys can be serialized in a
single batch pass. This commit takes advantage of that to also compute
the per-row hashes ahead of time (`WeakHash32` from each key column),
which then unlocks two speedups in the main aggregation loop:

1. **Skip hashing in `emplaceKey`/`findKey`.** The hash is already known,
   so the call goes through `data.emplace(key_holder, it, inserted,
   hash_value)` instead of recomputing. A new `emplaceImpl` overload
   takes the precomputed hash via a `compute_hash` template parameter.

2. **Software prefetch.** Each iteration prefetches the cache line for a
   future row's bucket using `data.prefetchByHash(hashes[row +
   look_ahead])`, hiding cache-miss latency on the hash table. The
   look-ahead distance is calibrated dynamically by `PrefetchingHelper`.
   For `findKey`, the prefetched cell is also tested via the new
   `data.isEmptyCell(hash_value)` (works on both `HashTable` and
   `TwoLevelHashTable`) so missed lookups exit before constructing the
   key holder.

Hash methods opt in by declaring `static constexpr bool
has_pre_computed_hashes = true` and providing `hashes`, `prefetching`,
`prefetch_look_ahead` members. `HashMethodSerialized<prealloc=true>`
opts in here; all others default to `false` via `HashMethodBase` and
keep the existing path. `HashMethodSingleLowCardinalityColumn` forwards
the flag from its underlying base.

Originally part of WIP PR #81944.
@clickhouse-gh

clickhouse-gh Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 9, 2026
Comment thread src/Common/ColumnsHashingImpl.h Outdated
alexey-milovidov and others added 2 commits May 9, 2026 23:25
The prealloc serialized aggregation path was passing `WeakHash32` values
(computed over source columns) to `data.emplace(..., hash_value)`,
`data.prefetchByHash`, and `data.isEmptyCell`. Hash tables for serialized
aggregation use `DefaultHash<std::string_view>` or `StringViewHash64` over
the *serialized bytes*, so the hash mismatch placed entries in the wrong
bucket and let `findKey` short-circuit when the key actually existed.

Concretely, for `LowCardinality` columns the source `WeakHash32` is taken
over dictionary indexes, so two rows with identical string values but
different positions/indexes produced different hashes and never merged in
the hash table. That broke `test_string_aggregation_compatibility` and
caused wrong results across multi-key string/serialized aggregation.

The fix:

  - Replace `WeakHash32 hashes` with `PaddedPODArray<size_t> precomputed_hashes`
    holding canonical per-row hashes from `Data::hash(serialized_keys[i])`.
  - Defer hash precomputation to the first `emplaceKey`/`findKey` call,
    where `Data` (and therefore its hash function) is known.
  - Only enable the precomputation when `use_batch_serialize` is true;
    otherwise fall back to the regular `data.prefetch(key_holder)` path.

Verified locally that GROUP BY over `(UInt64, LowCardinality(String))`,
multi-key `String` aggregation, and the `findKey` overflow path all
return the expected number of groups.

Addresses the AI review on
#104475 and the
`test_string_aggregation_compatibility` regressions reported in CI:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104475&sha=latest&name_0=PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/ColumnsHashingImpl.h Outdated
alexey-milovidov and others added 2 commits May 10, 2026 17:48
…sHashingImpl.h`

`ColumnsHashingImpl.h` is a high-fanout header in `src/Common`, and the
include of `<Interpreters/AggregationCommon.h>` added by the precomputed-hash
change did not introduce any usage of symbols from that header — the prefetch
code only needs `PrefetchingHelper` from `<Common/HashTable/Prefetching.h>`.

Drop the include to avoid unnecessary transitive compile-time cost and the
layering leakage of pulling an interpreter-layer header into common hashing
code.

Addresses review feedback on #104475

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

I think this is great:

Screenshot_20260511_195345

huldarchen pushed a commit to huldarchen/ClickHouse that referenced this pull request May 11, 2026
Tests that ran successfully (`NOT_FAILED` status) in the ParallelReplicas CI
shard are no longer broken under parallel replicas and can be enabled by
removing them from `tests/parallel_replicas_blacklist.txt`.

Removed 67 test names and the comment annotations that referred specifically
to those tests.

Source: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104475&sha=latest&name_0=PR
PR: ClickHouse#104475

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Algunenano Algunenano self-assigned this May 12, 2026
Comment thread src/Common/ColumnsHashingImpl.h
…prefetch` for prealloc serialized

The precomputed-hash path added for `HashMethodSerialized<prealloc=true>`
turned on `data.prefetchByHash` based purely on `use_batch_serialize`, so
`enable_software_prefetch_in_aggregation = 0` no longer disabled software
prefetch for `prealloc_serialized`, and the path also bypassed the existing
`min_bytes_for_prefetch` small-hashtable guard used by `Aggregator::executeImpl`.

Thread both controls through `HashMethodContextSettings`:
- `enable_prefetch` (set from `params.enable_prefetch`) gates whether
  `HashMethodSerialized` enters the precomputed-hash + prefetch path at all.
- `min_bytes_for_prefetch` (set from `Aggregator::min_bytes_for_prefetch`)
  is checked lazily on the first `emplaceKey`/`findKey` once `Data` is known;
  if the hash table's buffer is below the threshold, fall back to the
  regular emplace path for the remainder of the batch.
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, the CI failures on this PR look unrelated to the changes (which only touch hash table / aggregation code in src/Common/ColumnsHashing*, src/Common/HashTable/*, and src/Interpreters/Aggregator.cpp):

  1. Stateless test 04221_dictgetordefault_arrayjoin_80703BAD_ARGUMENTS in ExternalDictionariesLoader::resolveDictionaryName. Stack trace is in dictionary resolution and analyzer code. Report.
  2. Unit test FunctionsStress (tsan, function_prop_fuzzer)Logical error: 'Function writeSlice expects same column types for GenericValueSlice and GenericArraySink.'. This is in array function code. Report.
  3. AST fuzzer (amd_debug)Inconsistent AST formatting between 'ExpressionList' and 'ExpressionList' involving extra parentheses around tuple(...) / not(...). Looks like the same class of issue as More inconsistent AST formatting #104605. Report.
  4. AST fuzzer (arm_asan_ubsan)Invalid number of rows in Chunk in JoiningTransform::transform. Report.

Please investigate these and provide fixes in separate PRs. If any are already in progress, please link them here. I just merged master into the branch, so CI will rerun shortly — if the failures are flaky they may not reproduce.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov — investigated all four. All four are pre-existing chronic/trunk bugs unrelated to this PR (and two already have fix PRs in flight).

F1 — 04221_dictgetordefault_arrayjoin_80703 (BAD_ARGUMENTS)

Chronic-flaky on Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel) only. First seen 2026-05-12 (same day the test was added in 23d8c1f "Add test for #80703"). CIDB 30-day count: 21 failures across 18 PRs + 4 master hits, ALL on that one check. The error is Dictionary (tags_80703) not found during parallel-replica execution — the dictionary is created and immediately queried, and remote replicas don't see it in the per-query database scope.

Already in flight: PR #104778 (you, 2026-05-13 02:55 UTC) — Tag 04221_dictgetordefault_arrayjoin_80703 as no-parallel-replicas. MERGEABLE, +3 lines. Will resolve it once merged. No separate fix needed from us.

F2 — FunctionsStress (tsan, function_prop_fuzzer) — writeSlice expects same column types for GenericValueSlice and GenericArraySink

Rare chronic. 30-day CIDB: 3 hits total (this PR, PR #96886 on 2026-05-11, and one master hit 2026-04-17 in BuzzHouse (experimental, serverfuzz, amd_msan) with the GenericArraySlice/Sink variant). 90-day: ~11 hits, all on function_prop_fuzzer or BuzzHouse jobs.

This is a pre-existing limitation of FunctionIf::executeGenericArray (src/Functions/if.cpp:637GatherUtils::conditionalAlgorithms.h:470writeSlice at Algorithms.h:83). The function_prop_fuzzer constructs if(cond, value_of_TypeA, array_of_TypeB) argument shapes that bypass the normal type-unification done by the analyzer, hitting the Same column types assertion. We previously fixed one variant in #103084 (QBit structureEquals); the residual GenericValueSlice case is a different code path (scalar-vs-array, not column-equality). PR #104352 ("Improve function_prop_fuzzer", alexbakharew) was merged recently and likely contributed to its slightly higher visibility, but the underlying engine assertion has been there since at least Feb 2026.

Not actionable on this PR's timeline; needs proper handling in executeGenericArray to wrap mismatched-type "then"/"else" branches before calling conditional. I'll keep tracking via CIDB.

F3 — AST fuzzer (amd_debug) Inconsistent AST formatting between 'ExpressionList' and 'ExpressionList' (tuple/not parens)

Confirmed STID 1941-1bfa, "NOT-tuple subvariant" — same class as #104605 as you noted. Already tracked on our side; root cause and minimal reproducer found yesterday on your directive on PR #104591:

Minimal reproducer (debug build, exit 134):

CREATE TABLE t (c0 Int CODEC(not((not(materialize(1), materialize(2)))), ZSTD)) ENGINE=Memory

The CODEC arg parser produces Function_not(is_operator=true, 2 args) directly, while the SELECT/RoundBrackets path wraps not(a, b) into Function_not(1 arg) → Function_tuple(is_operator=true, 2 args). In CODEC context the formatter has frame.allow_operators=false, which blocks Function_tuple's operator-form (a, b) (ASTFunction.cpp:343), so it falls back to tuple(a, b). On re-parse, that re-parsed tuple(...) gets parenthesized=true from the surrounding (...), and IAST::decideParensEmission adds one extra (...) on the next round-trip → mismatch at executeQuery.cpp:1330.

PR #104626 (already open from us) only covers the lambda-after-comma subvariant of #104605. The NOT/tuple subvariant needs a separate small fix targeting either decideParensEmission to skip the extra parens when the parent forces function-call form, or harmonizing the CODEC vs SELECT parser to produce the same shape for not(a, b). I'll prepare that PR (option (a) is the most localized).

F4 — AST fuzzer (arm_asan_ubsan) Invalid number of rows in Chunk in JoiningTransform::transform

Confirmed STID 3366-2d1d family — same logical error as the 2026-04-23 master hit in BuzzHouse (amd_msan) already tracked on our side, and same family as previously-closed #84198 / #51312 (ScatteredBlock::filter indexing, fixed by @vdimir in #84213 for the parallel_hash thin-key variant). The fuzzer query on this PR is different from the April hit but produces the same column-size mismatch on the output Chunk:

expected 65409, got 130818  (column at position 4, exactly 2× — cross-product duplication not propagated to the other output columns)

60-day CIDB: 2 master hits, ~4 PR hits — rare chronic. We previously attempted local reproduction (~150 iterations across max_threads, max_block_size, parallel_hash, enable_lazy_columns_replication, join_runtime_filter_exact_values_limit) and couldn't trigger it without the full BuzzHouse oracle wrapper. Our note from that investigation: the likely site is ScatteredBlock::filter / appendRightColumns in the offsets.empty() branch when a bloom-filter false positive interacts with multi-slot ConcurrentHashJoin. We're keeping this under CIDB watch and will escalate to @vdimir / @KochetovNicolai / @Algunenano if we get a deterministic repro or hit-rate climbs.


Summary: None caused by this PR. F1 fixed by your #104778 (just needs merge). F3 follow-up fix PR coming from us this week. F2 and F4 stay on the chronic-trunk monitoring list. I'll comment on this PR again only if any of the four reoccurs after #104778 lands.

Comment thread src/Common/ColumnsHashingImpl.h Outdated
Address review: previously the hot path in `emplaceKey`/`findKey` checked
`derived.can_precompute_hashes` twice (once to enter the block, once after the
lazy-init block had a chance to flip it to `false`). The compiler cannot easily
fold the two checks because `initPrecomputedHashes` is `NO_INLINE` and may
mutate `can_precompute_hashes`.

Switch to a single-check pattern: `precomputed_hashes_initialized` starts
`true` by default and is set to `false` in the constructor only when we plan
to precompute. After the first call it is flipped back to `true` (regardless
of whether hashes were actually computed), so subsequent rows only do the one
`can_precompute_hashes` check in the hot path.

Also fold the `min_bytes_for_prefetch` size-threshold check into
`initPrecomputedHashes` itself, which is the natural place for it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pull Bot pushed a commit to Haofei/ClickHouse that referenced this pull request May 13, 2026
The test was removed from `tests/parallel_replicas_blacklist.txt` in
ClickHouse#104475 based on a single
green run, but it returns rows from `nullable_minmax_index` without
`ORDER BY`, and the reference file pins a particular insertion-order
output. Under parallel replicas the merging order across shards is not
the same as the local read order, so the test fails again on the
`amd_llvm_coverage, ParallelReplicas, s3 storage, parallel` configuration.

Observed failure:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=d02f6c889b3cbe833afc358e1828309add8fcca5&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20ParallelReplicas%2C%20s3%20storage%2C%20parallel%29

PR that re-enabled the test: ClickHouse#104475

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/ColumnsHashingImpl.h Outdated
`emplaceKey`/`findKey` previously fired `calcPrefetchLookAhead` only at
the absolute row `PrefetchingHelper::iterationsToMeasure()`. That is
wrong on sliced ranges (e.g. `executeOnBlockSmall` with non-zero
`row_begin`): ranges starting after row 100 never calibrate, and ranges
with `row_begin != 0` calibrate at the wrong moment. Compute the
calibration row from the first row we actually process, matching the
`row_begin + iterationsToMeasure` pattern already used in
`Aggregator::executeImplBatch`.

Addresses review: #104475
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 92.00% 92.00% +0.00%
Branches 76.60% 76.50% -0.10%

Changed lines: 74.24% (49/66) | lost baseline coverage: 1 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 14, 2026
Merged via the queue into master with commit 95cea43 May 14, 2026
167 checks passed
@alexey-milovidov alexey-milovidov deleted the precompute-hash-prefetch-prealloc branch May 14, 2026 14:28
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants