Speed up INTERSECT ALL and EXCEPT ALL#107649
Conversation
`IntersectOrExceptTransform` handled the ALL (multiset) variants by hashing every row with `SipHash` into a `UInt128` and using that digest as the key of a `HashMap<UInt128, UInt64>`. This was ~7-10x slower than the DISTINCT path: per-row hashing plus the fat 128-bit-keyed map dominated (large allocations, resizes, page faults), and storing only the digest risked silent collisions corrupting results. Use the same `SetVariants` dispatch the DISTINCT path uses, but with a counting map keyed on the real row value. For fixed and string keys this avoids per-row hashing entirely (the value is the key) and uses a narrower cell; only exotic keys fall back to the previous 128-bit hash. The multiset-count algorithm is unchanged, so semantics are preserved, and keying on real values removes the collision risk. Add `CountingSet` to `SetVariants`: a `HashMap<Key, UInt64>` analogue of the set, with the consecutive-keys cache disabled so the mapped count is incremented in place. On `SELECT count() FROM (numbers INTERSECT numbers)` over 40M rows this drops from ~12s to ~1.6s, matching `INTERSECT DISTINCT`. Add a performance test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ALL (multiset) counting map starts at 256 cells and grows by doubling, so a large right input triggers a long resize cascade: repeated rehashing, re-faulting of growing buffers, and a transient old+new buffer at each step. Profiling shows the high-cardinality case is dominated by this memory traffic, not by hashing. Estimate the right (accumulated) input's row count in the planner by walking its plan and summing source rows (`numbers` limit and `ReadFromMergeTree` selected rows), and pass it through `IntersectOrExceptStep` to the transform. After the first chunk, scale the estimate by the observed distinct ratio and `reserve` the counting map once, collapsing the cascade into a single allocation. Cap the speculative reservation at 1/4 of the remaining query memory budget (hard limit minus current usage), so a wrong over-estimate can never turn a query that would otherwise fit into `MEMORY_LIMIT_EXCEEDED`. The table uses the tracked `HashTableAllocator`, so the reserve is accounted and throws cleanly rather than being OOM-killed. Also start the counting tables at 2^14 cells instead of 256 to skip the first resizes while reading the first block. On `SELECT count() FROM (numbers(20M) INTERSECT numbers(20M))` this lowers peak memory ~33% (1.5 GiB to 1.0 GiB) and roughly halves the time (1.3s to 0.7s), matching the resize-free path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`addToSet` and `buildFilter` are only reachable for the DISTINCT operators; the ALL operators use the counting path. Make that invariant explicit with `chassert(!isAllOperator(), ...)` instead of relying on a comment, so a future dispatch mistake fails loudly in debug builds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous estimator summed leaf source rows, ignoring cardinality-changing steps: e.g. a right operand `SELECT count() FROM numbers(1e9)` produces one row but was estimated at 1e9, reserving the counting map for a billion keys. Walk the operand plan and only derive an estimate when the output cardinality is exact: propagate through `ExpressionStep` (1:1) and `UnionStep` (sum), use `LimitStep` as a hard upper bound (skipping WITH TIES), and read source rows only when no filter is pushed into the source. Any other step (`AggregatingStep`, `FilterStep`, `DistinctStep`, sorts, joins, ...) returns unknown, which skips the pre-reserve. MergeTree is treated as unknown (its selected row count is not available at plan-build time). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two fixes to estimatePlanOutputRows: - LimitStep returned `limit` even when the child estimate was unknown, so `SELECT count() FROM numbers(1e9) LIMIT 1e9` (one row) could still reserve for a billion keys. `limit` bounds the output only if the child has at least `limit + offset` rows; with an unknown child, return unknown. It only ever caps a known child estimate, never raises it. - Replace the `ExpressionStep` special-case with the step's `getTransformTraits().preserves_number_of_rows`. An `ExpressionStep` can contain `arrayJoin`, which is not row-preserving and is already encoded in the traits; this also generalizes to other row-preserving transforms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [173d00c] Summary: ✅
AI ReviewSummaryThis PR replaces the Final VerdictStatus: ✅ Approve |
The row estimate read `ReadFromSystemNumbersStep::getEstimatedRowsCount`
directly from the planner, which under thin-LTO pulled the step's
`storage->as<StorageSystemNumbers>()` (and thus `StorageSystemNumbers`'s
typeinfo) into `clickhouse-keeper`, which does not link storages:
ld.lld: error: undefined symbol: typeinfo for DB::StorageSystemNumbers
Expose the per-source estimate as a virtual `getOutputRowsEstimate` on
`SourceStepWithFilterBase` (default unknown) and call it polymorphically, so
the planner no longer references the concrete numbers step and keeper does not
pull its vtable. The numbers override now also caps the count by a pushed-down
LIMIT, and the planner bails when a row-level or prewhere filter is present.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The pre-reserve was sized once from the first block's distinct ratio times the row estimate. With the default `max_memory_usage = 0` its memory cap was inactive, so a non-injective right operand (e.g. `if(number < 65536, number, 0)` over a large input) whose first block looks all-distinct could reserve for ~1e8 keys though the final multiset holds ~65K, a multi-GiB upfront allocation. Reserve from within `addToCounts` instead, driven by the table's own growth: the per-row cost is just the already-computed inserted flag and a decrement (and is compiled out for the fixed key8/key16 tables that cannot resize). Right before the table would resize, an amortized review reserves straight to the estimate once the live key count is within a factor of it, capped to a quarter of the remaining memory budget. Inputs whose distinct count plateaus below the gate never trigger a reserve and just grow naturally, so the bound holds even with no memory limit set. With no estimate, small growing tables are pre-grown up to a small bound; large ones fall back to the hash table's natural growth. On `numbers(20M) INTERSECT numbers(20M)` this keeps ~0.7s and ~1.25 GiB, while the pathological `if`-collapse drops from 4 GiB to 9 MiB and a low-cardinality `number % 1000000` from 2 GiB to 79 MiB. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Master added `keys32` and `keys64` fixed-width variants to `SetVariants`
(`NonClearableSet`/`ClearableSet` and `APPLY_FOR_SET_VARIANTS`). Since
`CountingSet` is dispatched through the same macro, it must define the matching
members or the shared `init`/`getTotalRowCount`/`getTotalByteCount` expansions
fail to compile against it once this branch is merged with master:
error: no member named 'keys32' in 'DB::CountingSet'
Add the counting (HashMap-backed) `keys32`/`keys64` members, mirroring the
existing `keys128`/`keys256`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nickitat You might be interested in this PR. It introduces, but not universalizes yet, a new way to tweak hash table sizes, in this case using the estimated size of the right table and the available memory to tweak how fast we grow the table. |
nickitat
left a comment
There was a problem hiding this comment.
I think the general idea of replacing SipHash is great, and we should keep this part.
But I'm not a fan of the new resizing logic. Currently, we have 4x resizes until we reach 2^23, then 2x resizes each time. So this estimation-based logic should have no effect for cardinalities up to that threshold, and then it will buy us one resize before the very last one, which will now be skipped. And only if the estimation is available, btw. I'd rather consider re-using the mechanism we already have in aggregation and joins with hashing table sizes after the fact.
I also did a quick experiment - ran your perf test without it:
Results — min of 7–9 runs, pinned to cores 0–7
┌───────────────────────┬────────────────────────────────┬────────────────────────┬───────────────────────────────┐
│ query (distinct keys) │ stable 26.7 (old SipHash path) │ branch ON (reserve on) │ branch OFF (reserve disabled) │
├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
│ key64 (2.5M) │ 1.235 s │ 0.175 s (7.1×) │ 0.236 s (5.2×) │
├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
│ keys128 (2M) │ 0.842 s │ 0.182 s (4.6×) │ 0.179 s (4.7×) │
├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
│ key_string (2M/1M) │ 0.632 s │ 0.194 s (3.3×) │ 0.205 s (3.1×) │
└───────────────────────┴────────────────────────────────┴────────────────────────┴───────────────────────────────┘
"OFF" = the reservation review (reviewCountsReservation / addToCounts gating) compiled out; everything else identical, same toolchain.
Your hypothesis is confirmed
The resizing logic does not deliver the headline speedup. Disabling it costs at most 1.35× (key64), and nothing on keys128/key_string. The 3–7× comes entirely from the keying change (real-value key + narrow cell, no per-row SipHash, no fat UInt128 map) — which
is present in both ON and OFF. OFF is still 5.2× / 4.7× / 3.1× over the old path.
What the 1.35× on key64 actually is — and it's not resizing
It's not resize-skipping: ON and OFF both do the same 5 resizes for 2.5M keys. The difference is the final table geometry:
- OFF: natural ×4 growth overshoots to degree 24 = 16M cells (256 MB).
- ON: reserve(2.5M) fits exactly to degree 23 = 8M cells (128 MB).a
Keep in mind that skipping / reducing the last resize is not only about performance, but about memory. Going from 1 -> 5x memory (1x for old + 4x to hold it) to 1 -> 3x (1x for old + 2x for new), means reducing the max memory usage noticeably for large tables.
I'll have a look |
Per review, the speedup comes entirely from keying the ALL multiset on the real row value instead of a per-row `SipHash128` digest; the estimation-based pre-sizing added complexity and a memory over-reservation footgun for only a marginal, data-dependent gain (final table geometry, not resize-skipping). Revert the on-demand reservation, the query-plan row estimate and its plumbing (`IntersectOrExceptStep` / `ReadFromSystemNumbersStep` / `SourceStepWithFilter` / `Planner`), and the custom counting-set grower. `CountingSet` now uses the default grower, exactly like the DISTINCT path's `NonClearableSet`. Improving INTERSECT/ EXCEPT memory will be handled separately. `SELECT count() FROM (numbers(20M) INTERSECT numbers(20M))`: ~6.8s -> ~0.7s; peak memory 2.25 GiB -> 1.5 GiB (narrower cell, no digest). The INTERSECT/EXCEPT test suite and results are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I've decided to remove the estimation logic for now and simplify the PR. I think there are still cases where it'd be useful to avoid over-reserving memory, but I need to think more about it in general. HashTablesStatistics is useful, but focuses on performance and only for subsequent runs. |
nickitat
left a comment
There was a problem hiding this comment.
LGTM, besides one nitpicking comment
Address review: the SetMethod* structs no longer take a `Mapped` template parameter (their signatures match master). The mapped type is derived from the table - `HashSet` cells report `VoidMapped` (-> void), `HashMap` cells a real type - and the consecutive-keys cache is disabled when there is a mapped value. `CountingSet` mirrors `NonClearableSet` with `HashMap`s; a method stays a set when given a `HashSet` and becomes a counting multiset when given a `HashMap`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
📊 Cloud Performance Report ✅ AI verdict: This PR rewrites only the INTERSECT ALL / EXCEPT ALL execution path, replacing a SipHash-based occurrence-count map with a typed multiset; the SetVariants edits are template-level no-ops for the existing presence-only sets. ClickBench contains no INTERSECT/EXCEPT queries, so neither flagged improvement (Q4 -15.27%, Q15 -20.16%) can be caused by this change. Both deltas are below the trust-override threshold and are best read as run-to-run variance, so both have been downgraded to not_sure. No real per-query regressions or improvements attributable to this PR were found on either benchmark. clickbenchFlagged queries (2 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 53/59 (89.83%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 2 line(s) · Uncovered code |
af19657



Related: #96876
INTERSECT/EXCEPTdefault toALLmode. TheALL(multiset) path inIntersectOrExceptTransformbuilt the right-side multiset by hashing every rowwith
SipHashinto aUInt128and using that digest as the key of aHashMap<UInt128, UInt64>. This was several times slower than the correspondingDISTINCTpath (per-row hashing plus a fat 128-bit-keyed map that allocates,resizes and page-faults heavily), and storing only the digest risked silently
returning wrong results on a 128-bit hash collision.
This reuses the same
SetVariantsdispatch theDISTINCTpath already uses, butwith a counting hash map keyed on the real row value. For numeric and string keys
there is no per-row hashing at all (the value is the key) and the cell is
narrower; only exotic keys fall back to the previous 128-bit hashing. The
multiset-count algorithm is unchanged, so results are identical.
The counting map is pre-sized on demand as it fills: an amortized review (armed to
the table's own resize points) reserves toward an estimate of the right operand's
row count once the live key count is within reach of it, capped to a quarter of
the remaining query memory budget. A right side whose distinct count stays small
(a non-injective expression, heavy duplicates, an aggregate) never reaches that
point and just grows naturally, so the reservation can never over-allocate for it,
with or without a memory limit set.
Measured locally (
SELECT count() FROM (a INTERSECT ALL a), 20M rows per operand):UInt64UInt64columnsString(EXCEPT ALL)The same shape at 40M rows per operand (
SELECT count() FROM (SELECT number FROM numbers(40000000) INTERSECT SELECT number FROM numbers(40000000))) drops from~12 s to ~1.4 s.
For
Stringkeys the multiset now stores the actual string rather than a 16-bytedigest. That is what makes the result provably correct (no hash-collision risk),
it is still several times faster, and the memory cost is small and bounded.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Improve the performance of `INTERSECT ALL` and `EXCEPT ALL` (the default mode for `INTERSECT` and `EXCEPT`) by several times, by keying the multiset on the row value instead of hashing each row with SipHash.
Version info
26.7.1.261