Speed up INTERSECT ALL and EXCEPT ALL by Algunenano · Pull Request #107649 · ClickHouse/ClickHouse · GitHub
Skip to content

Speed up INTERSECT ALL and EXCEPT ALL#107649

Merged
Algunenano merged 13 commits into
ClickHouse:masterfrom
Algunenano:intersect-all-counting-perf
Jun 29, 2026
Merged

Speed up INTERSECT ALL and EXCEPT ALL#107649
Algunenano merged 13 commits into
ClickHouse:masterfrom
Algunenano:intersect-all-counting-perf

Conversation

@Algunenano

@Algunenano Algunenano commented Jun 16, 2026

Copy link
Copy Markdown
Member

Related: #96876

INTERSECT/EXCEPT default to ALL mode. The ALL (multiset) path in
IntersectOrExceptTransform built the right-side multiset by hashing every row
with SipHash into a UInt128 and using that digest as the key of a
HashMap<UInt128, UInt64>. This was several times slower than the corresponding
DISTINCT path (per-row hashing plus a fat 128-bit-keyed map that allocates,
resizes and page-faults heavily), and storing only the digest risked silently
returning wrong results on a 128-bit hash collision.

This reuses the same SetVariants dispatch the DISTINCT path already uses, but
with a counting hash map keyed on the real row value. For numeric and string keys
there is no per-row hashing at all (the value is the key) and the cell is
narrower; only exotic keys fall back to the previous 128-bit hashing. The
multiset-count algorithm is unchanged, so results are identical.

The counting map is pre-sized on demand as it fills: an amortized review (armed to
the table's own resize points) reserves toward an estimate of the right operand's
row count once the live key count is within reach of it, capped to a quarter of
the remaining query memory budget. A right side whose distinct count stays small
(a non-injective expression, heavy duplicates, an aggregate) never reaches that
point and just grows naturally, so the reservation can never over-allocate for it,
with or without a memory limit set.

Measured locally (SELECT count() FROM (a INTERSECT ALL a), 20M rows per operand):

Right operand key master this PR
UInt64 6.8 s, 2.25 GiB 0.7 s, 1.25 GiB
two UInt64 columns 6.7 s, 2.25 GiB 1.4 s, 1.88 GiB
String (EXCEPT ALL) 2.1 s, 483 MiB 0.6 s, 659 MiB

The same shape at 40M rows per operand (SELECT count() FROM (SELECT number FROM numbers(40000000) INTERSECT SELECT number FROM numbers(40000000))) drops from
~12 s to ~1.4 s.

For String keys the multiset now stores the actual string rather than a 16-byte
digest. That is what makes the result provably correct (no hash-collision risk),
it is still several times faster, and the memory cost is small and bounded.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Improve the performance of `INTERSECT ALL` and `EXCEPT ALL` (the default mode for `INTERSECT` and `EXCEPT`) by several times, by keying the multiset on the row value instead of hashing each row with SipHash.

Version info

  • Merged into: 26.7.1.261

Algunenano and others added 5 commits June 16, 2026 17:37
`IntersectOrExceptTransform` handled the ALL (multiset) variants by
hashing every row with `SipHash` into a `UInt128` and using that digest
as the key of a `HashMap<UInt128, UInt64>`. This was ~7-10x slower than
the DISTINCT path: per-row hashing plus the fat 128-bit-keyed map
dominated (large allocations, resizes, page faults), and storing only
the digest risked silent collisions corrupting results.

Use the same `SetVariants` dispatch the DISTINCT path uses, but with a
counting map keyed on the real row value. For fixed and string keys this
avoids per-row hashing entirely (the value is the key) and uses a
narrower cell; only exotic keys fall back to the previous 128-bit hash.
The multiset-count algorithm is unchanged, so semantics are preserved,
and keying on real values removes the collision risk.

Add `CountingSet` to `SetVariants`: a `HashMap<Key, UInt64>` analogue of
the set, with the consecutive-keys cache disabled so the mapped count is
incremented in place.

On `SELECT count() FROM (numbers INTERSECT numbers)` over 40M rows this
drops from ~12s to ~1.6s, matching `INTERSECT DISTINCT`.

Add a performance test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ALL (multiset) counting map starts at 256 cells and grows by doubling,
so a large right input triggers a long resize cascade: repeated rehashing,
re-faulting of growing buffers, and a transient old+new buffer at each step.
Profiling shows the high-cardinality case is dominated by this memory
traffic, not by hashing.

Estimate the right (accumulated) input's row count in the planner by
walking its plan and summing source rows (`numbers` limit and
`ReadFromMergeTree` selected rows), and pass it through
`IntersectOrExceptStep` to the transform. After the first chunk, scale the
estimate by the observed distinct ratio and `reserve` the counting map once,
collapsing the cascade into a single allocation.

Cap the speculative reservation at 1/4 of the remaining query memory budget
(hard limit minus current usage), so a wrong over-estimate can never turn a
query that would otherwise fit into `MEMORY_LIMIT_EXCEEDED`. The table uses
the tracked `HashTableAllocator`, so the reserve is accounted and throws
cleanly rather than being OOM-killed. Also start the counting tables at
2^14 cells instead of 256 to skip the first resizes while reading the first
block.

On `SELECT count() FROM (numbers(20M) INTERSECT numbers(20M))` this lowers
peak memory ~33% (1.5 GiB to 1.0 GiB) and roughly halves the time
(1.3s to 0.7s), matching the resize-free path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`addToSet` and `buildFilter` are only reachable for the DISTINCT operators;
the ALL operators use the counting path. Make that invariant explicit with
`chassert(!isAllOperator(), ...)` instead of relying on a comment, so a future
dispatch mistake fails loudly in debug builds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous estimator summed leaf source rows, ignoring cardinality-changing
steps: e.g. a right operand `SELECT count() FROM numbers(1e9)` produces one
row but was estimated at 1e9, reserving the counting map for a billion keys.

Walk the operand plan and only derive an estimate when the output cardinality
is exact: propagate through `ExpressionStep` (1:1) and `UnionStep` (sum), use
`LimitStep` as a hard upper bound (skipping WITH TIES), and read source rows
only when no filter is pushed into the source. Any other step
(`AggregatingStep`, `FilterStep`, `DistinctStep`, sorts, joins, ...) returns
unknown, which skips the pre-reserve. MergeTree is treated as unknown (its
selected row count is not available at plan-build time).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two fixes to estimatePlanOutputRows:

- LimitStep returned `limit` even when the child estimate was unknown, so
  `SELECT count() FROM numbers(1e9) LIMIT 1e9` (one row) could still reserve
  for a billion keys. `limit` bounds the output only if the child has at
  least `limit + offset` rows; with an unknown child, return unknown. It only
  ever caps a known child estimate, never raises it.

- Replace the `ExpressionStep` special-case with the step's
  `getTransformTraits().preserves_number_of_rows`. An `ExpressionStep` can
  contain `arrayJoin`, which is not row-preserving and is already encoded in
  the traits; this also generalizes to other row-preserving transforms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [173d00c]

Summary:


AI Review

Summary

This PR replaces the INTERSECT ALL / EXCEPT ALL multiset implementation with CountingSetVariants, using the existing SetVariants key-method dispatch for the common numeric, fixed-size, and string key paths while preserving the previous hashed fallback for other keys. I did not find a new correctness, safety, or performance issue in the current diff.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Jun 16, 2026
Comment thread src/Processors/Transforms/IntersectOrExceptTransform.cpp Outdated
Algunenano and others added 5 commits June 17, 2026 12:31
The row estimate read `ReadFromSystemNumbersStep::getEstimatedRowsCount`
directly from the planner, which under thin-LTO pulled the step's
`storage->as<StorageSystemNumbers>()` (and thus `StorageSystemNumbers`'s
typeinfo) into `clickhouse-keeper`, which does not link storages:

    ld.lld: error: undefined symbol: typeinfo for DB::StorageSystemNumbers

Expose the per-source estimate as a virtual `getOutputRowsEstimate` on
`SourceStepWithFilterBase` (default unknown) and call it polymorphically, so
the planner no longer references the concrete numbers step and keeper does not
pull its vtable. The numbers override now also caps the count by a pushed-down
LIMIT, and the planner bails when a row-level or prewhere filter is present.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The pre-reserve was sized once from the first block's distinct ratio times the
row estimate. With the default `max_memory_usage = 0` its memory cap was
inactive, so a non-injective right operand (e.g. `if(number < 65536, number, 0)`
over a large input) whose first block looks all-distinct could reserve for ~1e8
keys though the final multiset holds ~65K, a multi-GiB upfront allocation.

Reserve from within `addToCounts` instead, driven by the table's own growth: the
per-row cost is just the already-computed inserted flag and a decrement (and is
compiled out for the fixed key8/key16 tables that cannot resize). Right before
the table would resize, an amortized review reserves straight to the estimate
once the live key count is within a factor of it, capped to a quarter of the
remaining memory budget. Inputs whose distinct count plateaus below the gate
never trigger a reserve and just grow naturally, so the bound holds even with no
memory limit set. With no estimate, small growing tables are pre-grown up to a
small bound; large ones fall back to the hash table's natural growth.

On `numbers(20M) INTERSECT numbers(20M)` this keeps ~0.7s and ~1.25 GiB, while
the pathological `if`-collapse drops from 4 GiB to 9 MiB and a low-cardinality
`number % 1000000` from 2 GiB to 79 MiB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Master added `keys32` and `keys64` fixed-width variants to `SetVariants`
(`NonClearableSet`/`ClearableSet` and `APPLY_FOR_SET_VARIANTS`). Since
`CountingSet` is dispatched through the same macro, it must define the matching
members or the shared `init`/`getTotalRowCount`/`getTotalByteCount` expansions
fail to compile against it once this branch is merged with master:

    error: no member named 'keys32' in 'DB::CountingSet'

Add the counting (HashMap-backed) `keys32`/`keys64` members, mirroring the
existing `keys128`/`keys256`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Algunenano

Copy link
Copy Markdown
Member Author
image image

@nickitat You might be interested in this PR. It introduces, but not universalizes yet, a new way to tweak hash table sizes, in this case using the estimated size of the right table and the available memory to tweak how fast we grow the table.

@nickitat nickitat self-assigned this Jun 18, 2026

@nickitat nickitat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general idea of replacing SipHash is great, and we should keep this part.
But I'm not a fan of the new resizing logic. Currently, we have 4x resizes until we reach 2^23, then 2x resizes each time. So this estimation-based logic should have no effect for cardinalities up to that threshold, and then it will buy us one resize before the very last one, which will now be skipped. And only if the estimation is available, btw. I'd rather consider re-using the mechanism we already have in aggregation and joins with hashing table sizes after the fact.
I also did a quick experiment - ran your perf test without it:

  Results — min of 7–9 runs, pinned to cores 0–7

  ┌───────────────────────┬────────────────────────────────┬────────────────────────┬───────────────────────────────┐
  │ query (distinct keys) │ stable 26.7 (old SipHash path) │ branch ON (reserve on) │ branch OFF (reserve disabled) │
  ├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
  │ key64 (2.5M)          │ 1.235 s                        │ 0.175 s (7.1×)         │ 0.236 s (5.2×)                │
  ├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
  │ keys128 (2M)          │ 0.842 s                        │ 0.182 s (4.6×)         │ 0.179 s (4.7×)                │
  ├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
  │ key_string (2M/1M)    │ 0.632 s                        │ 0.194 s (3.3×)         │ 0.205 s (3.1×)                │
  └───────────────────────┴────────────────────────────────┴────────────────────────┴───────────────────────────────┘

  "OFF" = the reservation review (reviewCountsReservation / addToCounts gating) compiled out; everything else identical, same toolchain.

  Your hypothesis is confirmed

  The resizing logic does not deliver the headline speedup. Disabling it costs at most 1.35× (key64), and nothing on keys128/key_string. The 3–7× comes entirely from the keying change (real-value key + narrow cell, no per-row SipHash, no fat UInt128 map) — which
  is present in both ON and OFF. OFF is still 5.2× / 4.7× / 3.1× over the old path.

  What the 1.35× on key64 actually is — and it's not resizing

  It's not resize-skipping: ON and OFF both do the same 5 resizes for 2.5M keys. The difference is the final table geometry:

  - OFF: natural ×4 growth overshoots to degree 24 = 16M cells (256 MB).
  - ON: reserve(2.5M) fits exactly to degree 23 = 8M cells (128 MB).a

@Algunenano

Copy link
Copy Markdown
Member Author

I also did a quick experiment - ran your perf test without it:

Keep in mind that skipping / reducing the last resize is not only about performance, but about memory. Going from 1 -> 5x memory (1x for old + 4x to hold it) to 1 -> 3x (1x for old + 2x for new), means reducing the max memory usage noticeably for large tables.

I'd rather consider re-using the mechanism we already have in aggregation and joins with hashing table sizes after the fact.

I'll have a look

Algunenano and others added 2 commits June 26, 2026 13:10
Per review, the speedup comes entirely from keying the ALL multiset on the real
row value instead of a per-row `SipHash128` digest; the estimation-based
pre-sizing added complexity and a memory over-reservation footgun for only a
marginal, data-dependent gain (final table geometry, not resize-skipping).

Revert the on-demand reservation, the query-plan row estimate and its plumbing
(`IntersectOrExceptStep` / `ReadFromSystemNumbersStep` / `SourceStepWithFilter` /
`Planner`), and the custom counting-set grower. `CountingSet` now uses the default
grower, exactly like the DISTINCT path's `NonClearableSet`. Improving INTERSECT/
EXCEPT memory will be handled separately.

`SELECT count() FROM (numbers(20M) INTERSECT numbers(20M))`: ~6.8s -> ~0.7s; peak
memory 2.25 GiB -> 1.5 GiB (narrower cell, no digest). The INTERSECT/EXCEPT test
suite and results are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Algunenano

Copy link
Copy Markdown
Member Author

I've decided to remove the estimation logic for now and simplify the PR. I think there are still cases where it'd be useful to avoid over-reserving memory, but I need to think more about it in general. HashTablesStatistics is useful, but focuses on performance and only for subsequent runs.

@nickitat nickitat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, besides one nitpicking comment

Comment thread src/Interpreters/SetVariants.h Outdated
Address review: the SetMethod* structs no longer take a `Mapped` template
parameter (their signatures match master). The mapped type is derived from the
table - `HashSet` cells report `VoidMapped` (-> void), `HashMap` cells a real
type - and the consecutive-keys cache is disabled when there is a mapped value.
`CountingSet` mirrors `NonClearableSet` with `HashMap`s; a method stays a set
when given a `HashSet` and becomes a counting multiset when given a `HashMap`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR rewrites only the INTERSECT ALL / EXCEPT ALL execution path, replacing a SipHash-based occurrence-count map with a typed multiset; the SetVariants edits are template-level no-ops for the existing presence-only sets. ClickBench contains no INTERSECT/EXCEPT queries, so neither flagged improvement (Q4 -15.27%, Q15 -20.16%) can be caused by this change. Both deltas are below the trust-override threshold and are best read as run-to-run variance, so both have been downgraded to not_sure. No real per-query regressions or improvements attributable to this PR were found on either benchmark.

clickbench

⚠️ 2 inconclusive

Flagged queries (2 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
⚠️ 4 not_sure 262 222 -15.3% <0.0001 PR only changes INTERSECT ALL/EXCEPT ALL counting path; ClickBench has no such queries, so -15% is variance
⚠️ 15 not_sure 248 198 -20.2% <0.0001 Off-path: PR touches only INTERSECT/EXCEPT ALL transform; Q15 doesn't exercise it, -20% is run-to-run variance

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: d6f030af-4702-41ec-88a9-fe1ef61b4c1a
  • MIRAI run: b27178c7-f3de-4f9d-9837-87ef56012e8c
  • PR check IDs:
    • clickbench_162347_1782491652
    • clickbench_162353_1782491652
    • clickbench_162363_1782491652
    • tpch_adapted_1_official_162371_1782491652
    • tpch_adapted_1_official_162385_1782491652
    • tpch_adapted_1_official_162397_1782491652

@clickhouse-gh

clickhouse-gh Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.30% 85.30% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.50% 77.60% +0.10%

Changed lines: Changed C/C++ lines covered by tests: 53/59 (89.83%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 2 line(s) · Uncovered code

Full report · Diff report

@Algunenano Algunenano added this pull request to the merge queue Jun 29, 2026
Merged via the queue into ClickHouse:master with commit af19657 Jun 29, 2026
327 of 330 checks passed
@Algunenano Algunenano deleted the intersect-all-counting-perf branch June 29, 2026 18:11
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants