Precompute hash and prefetch for prealloc aggregation variants#104475
Conversation
For the `prealloc_serialized` family of aggregation methods, the row sizes are already collected upfront so the keys can be serialized in a single batch pass. This commit takes advantage of that to also compute the per-row hashes ahead of time (`WeakHash32` from each key column), which then unlocks two speedups in the main aggregation loop: 1. **Skip hashing in `emplaceKey`/`findKey`.** The hash is already known, so the call goes through `data.emplace(key_holder, it, inserted, hash_value)` instead of recomputing. A new `emplaceImpl` overload takes the precomputed hash via a `compute_hash` template parameter. 2. **Software prefetch.** Each iteration prefetches the cache line for a future row's bucket using `data.prefetchByHash(hashes[row + look_ahead])`, hiding cache-miss latency on the hash table. The look-ahead distance is calibrated dynamically by `PrefetchingHelper`. For `findKey`, the prefetched cell is also tested via the new `data.isEmptyCell(hash_value)` (works on both `HashTable` and `TwoLevelHashTable`) so missed lookups exit before constructing the key holder. Hash methods opt in by declaring `static constexpr bool has_pre_computed_hashes = true` and providing `hashes`, `prefetching`, `prefetch_look_ahead` members. `HashMethodSerialized<prealloc=true>` opts in here; all others default to `false` via `HashMethodBase` and keep the existing path. `HashMethodSingleLowCardinalityColumn` forwards the flag from its underlying base. Originally part of WIP PR #81944.
The prealloc serialized aggregation path was passing `WeakHash32` values
(computed over source columns) to `data.emplace(..., hash_value)`,
`data.prefetchByHash`, and `data.isEmptyCell`. Hash tables for serialized
aggregation use `DefaultHash<std::string_view>` or `StringViewHash64` over
the *serialized bytes*, so the hash mismatch placed entries in the wrong
bucket and let `findKey` short-circuit when the key actually existed.
Concretely, for `LowCardinality` columns the source `WeakHash32` is taken
over dictionary indexes, so two rows with identical string values but
different positions/indexes produced different hashes and never merged in
the hash table. That broke `test_string_aggregation_compatibility` and
caused wrong results across multi-key string/serialized aggregation.
The fix:
- Replace `WeakHash32 hashes` with `PaddedPODArray<size_t> precomputed_hashes`
holding canonical per-row hashes from `Data::hash(serialized_keys[i])`.
- Defer hash precomputation to the first `emplaceKey`/`findKey` call,
where `Data` (and therefore its hash function) is known.
- Only enable the precomputation when `use_batch_serialize` is true;
otherwise fall back to the regular `data.prefetch(key_holder)` path.
Verified locally that GROUP BY over `(UInt64, LowCardinality(String))`,
multi-key `String` aggregation, and the `findKey` overflow path all
return the expected number of groups.
Addresses the AI review on
#104475 and the
`test_string_aggregation_compatibility` regressions reported in CI:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104475&sha=latest&name_0=PR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sHashingImpl.h` `ColumnsHashingImpl.h` is a high-fanout header in `src/Common`, and the include of `<Interpreters/AggregationCommon.h>` added by the precomputed-hash change did not introduce any usage of symbols from that header — the prefetch code only needs `PrefetchingHelper` from `<Common/HashTable/Prefetching.h>`. Drop the include to avoid unnecessary transitive compile-time cost and the layering leakage of pulling an interpreter-layer header into common hashing code. Addresses review feedback on #104475 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests that ran successfully (`NOT_FAILED` status) in the ParallelReplicas CI shard are no longer broken under parallel replicas and can be enabled by removing them from `tests/parallel_replicas_blacklist.txt`. Removed 67 test names and the comment annotations that referred specifically to those tests. Source: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104475&sha=latest&name_0=PR PR: ClickHouse#104475 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…prefetch` for prealloc serialized The precomputed-hash path added for `HashMethodSerialized<prealloc=true>` turned on `data.prefetchByHash` based purely on `use_batch_serialize`, so `enable_software_prefetch_in_aggregation = 0` no longer disabled software prefetch for `prealloc_serialized`, and the path also bypassed the existing `min_bytes_for_prefetch` small-hashtable guard used by `Aggregator::executeImpl`. Thread both controls through `HashMethodContextSettings`: - `enable_prefetch` (set from `params.enable_prefetch`) gates whether `HashMethodSerialized` enters the precomputed-hash + prefetch path at all. - `min_bytes_for_prefetch` (set from `Aggregator::min_bytes_for_prefetch`) is checked lazily on the first `emplaceKey`/`findKey` once `Data` is known; if the hash table's buffer is below the threshold, fall back to the regular emplace path for the remainder of the batch.
|
@groeneai, the CI failures on this PR look unrelated to the changes (which only touch hash table / aggregation code in
Please investigate these and provide fixes in separate PRs. If any are already in progress, please link them here. I just merged master into the branch, so CI will rerun shortly — if the failures are flaky they may not reproduce. |
|
@alexey-milovidov — investigated all four. All four are pre-existing chronic/trunk bugs unrelated to this PR (and two already have fix PRs in flight). F1 —
|
Address review: previously the hot path in `emplaceKey`/`findKey` checked `derived.can_precompute_hashes` twice (once to enter the block, once after the lazy-init block had a chance to flip it to `false`). The compiler cannot easily fold the two checks because `initPrecomputedHashes` is `NO_INLINE` and may mutate `can_precompute_hashes`. Switch to a single-check pattern: `precomputed_hashes_initialized` starts `true` by default and is set to `false` in the constructor only when we plan to precompute. After the first call it is flipped back to `true` (regardless of whether hashes were actually computed), so subsequent rows only do the one `can_precompute_hashes` check in the hot path. Also fold the `min_bytes_for_prefetch` size-threshold check into `initPrecomputedHashes` itself, which is the natural place for it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test was removed from `tests/parallel_replicas_blacklist.txt` in ClickHouse#104475 based on a single green run, but it returns rows from `nullable_minmax_index` without `ORDER BY`, and the reference file pins a particular insertion-order output. Under parallel replicas the merging order across shards is not the same as the local read order, so the test fails again on the `amd_llvm_coverage, ParallelReplicas, s3 storage, parallel` configuration. Observed failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=d02f6c889b3cbe833afc358e1828309add8fcca5&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20ParallelReplicas%2C%20s3%20storage%2C%20parallel%29 PR that re-enabled the test: ClickHouse#104475 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`emplaceKey`/`findKey` previously fired `calcPrefetchLookAhead` only at the absolute row `PrefetchingHelper::iterationsToMeasure()`. That is wrong on sliced ranges (e.g. `executeOnBlockSmall` with non-zero `row_begin`): ranges starting after row 100 never calibrate, and ranges with `row_begin != 0` calibrate at the wrong moment. Compute the calibration row from the first row we actually process, matching the `row_begin + iterationsToMeasure` pattern already used in `Aggregator::executeImplBatch`. Addresses review: #104475
LLVM Coverage ReportChanged lines: 74.24% (49/66) | lost baseline coverage: 1 line(s) · Uncovered code |


Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
For the
prealloc_serializedfamily of aggregation methods, precompute per-row hashes during the batch-serialization pass and use them to (a) skip rehashing inemplaceKey/findKeyand (b) software-prefetch the next row's bucket. Speeds up multi-key string/serialized aggregation by hiding hash-table cache-miss latency.Documentation entry for user-facing changes
Why
When
HashMethodSerialized<prealloc=true>is chosen for aggregation, the row sizes are already collected up front so keys can be serialized in a single batch. The hash for each row is computed twice in this codepath — once implicitly while the hash table looks up the cell, then again every time we touch that row. We can compute it once during the batch pass and reuse it.That precomputed hash also lets us prefetch the future row's bucket: by the time the loop reaches it, the cache line is warm.
What changed
static constexpr bool has_pre_computed_hashes = trueand providehashes(aWeakHash32),prefetching(PrefetchingHelper), andprefetch_look_ahead.HashMethodSerialized<prealloc=true>opts in.HashMethodSingleLowCardinalityColumnforwards the flag from its base. All other hash methods default tofalseviaHashMethodBase.ColumnsHashingImpl::emplaceKey/findKey: whenhas_pre_computed_hashes, prefetch the bucketlook_aheadrows ahead and skip the rehash on the current row. ForfindKey, also testdata.isEmptyCell(hash_value)to early-exit a miss.emplaceImpl<bool compute_hash>overload chooses betweendata.emplace(...)anddata.emplace(..., hash_value).HashTableexposesprefetchByHash(hash)andisEmptyCell(hash)publicly;TwoLevelHashTableadds matching overloads that route to the right bucket.Origin
Split out from WIP PR #81944. The original commit also tweaked the
PrefetchingHelper::calcPrefetchLookAheadalgorithm; that piece is left out here because master already evolved a different (integer-arithmetic) version of the same calibration and the precomputed-hash path works fine on top of it.Version info
26.5.1.649