iframe-proxy

Algunenano · 2026-06-16T16:55:27Z

Related: #96876

INTERSECT/EXCEPT default to ALL mode. The ALL (multiset) path in
IntersectOrExceptTransform built the right-side multiset by hashing every row
with SipHash into a UInt128 and using that digest as the key of a
HashMap<UInt128, UInt64>. This was several times slower than the corresponding
DISTINCT path (per-row hashing plus a fat 128-bit-keyed map that allocates,
resizes and page-faults heavily), and storing only the digest risked silently
returning wrong results on a 128-bit hash collision.

This reuses the same SetVariants dispatch the DISTINCT path already uses, but
with a counting hash map keyed on the real row value. For numeric and string keys
there is no per-row hashing at all (the value is the key) and the cell is
narrower; only exotic keys fall back to the previous 128-bit hashing. The
multiset-count algorithm is unchanged, so results are identical.

The counting map is pre-sized on demand as it fills: an amortized review (armed to
the table's own resize points) reserves toward an estimate of the right operand's
row count once the live key count is within reach of it, capped to a quarter of
the remaining query memory budget. A right side whose distinct count stays small
(a non-injective expression, heavy duplicates, an aggregate) never reaches that
point and just grows naturally, so the reservation can never over-allocate for it,
with or without a memory limit set.

Measured locally (SELECT count() FROM (a INTERSECT ALL a), 20M rows per operand):

The same shape at 40M rows per operand (SELECT count() FROM (SELECT number FROM numbers(40000000) INTERSECT SELECT number FROM numbers(40000000))) drops from
~12 s to ~1.4 s.

For String keys the multiset now stores the actual string rather than a 16-byte
digest. That is what makes the result provably correct (no hash-collision risk),
it is still several times faster, and the memory cost is small and bounded.

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Improve the performance of `INTERSECT ALL` and `EXCEPT ALL` (the default mode for `INTERSECT` and `EXCEPT`) by several times, by keying the multiset on the row value instead of hashing each row with SipHash.

Version info

Merged into: 26.7.1.261

`IntersectOrExceptTransform` handled the ALL (multiset) variants by hashing every row with `SipHash` into a `UInt128` and using that digest as the key of a `HashMap<UInt128, UInt64>`. This was ~7-10x slower than the DISTINCT path: per-row hashing plus the fat 128-bit-keyed map dominated (large allocations, resizes, page faults), and storing only the digest risked silent collisions corrupting results. Use the same `SetVariants` dispatch the DISTINCT path uses, but with a counting map keyed on the real row value. For fixed and string keys this avoids per-row hashing entirely (the value is the key) and uses a narrower cell; only exotic keys fall back to the previous 128-bit hash. The multiset-count algorithm is unchanged, so semantics are preserved, and keying on real values removes the collision risk. Add `CountingSet` to `SetVariants`: a `HashMap<Key, UInt64>` analogue of the set, with the consecutive-keys cache disabled so the mapped count is incremented in place. On `SELECT count() FROM (numbers INTERSECT numbers)` over 40M rows this drops from ~12s to ~1.6s, matching `INTERSECT DISTINCT`. Add a performance test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The ALL (multiset) counting map starts at 256 cells and grows by doubling, so a large right input triggers a long resize cascade: repeated rehashing, re-faulting of growing buffers, and a transient old+new buffer at each step. Profiling shows the high-cardinality case is dominated by this memory traffic, not by hashing. Estimate the right (accumulated) input's row count in the planner by walking its plan and summing source rows (`numbers` limit and `ReadFromMergeTree` selected rows), and pass it through `IntersectOrExceptStep` to the transform. After the first chunk, scale the estimate by the observed distinct ratio and `reserve` the counting map once, collapsing the cascade into a single allocation. Cap the speculative reservation at 1/4 of the remaining query memory budget (hard limit minus current usage), so a wrong over-estimate can never turn a query that would otherwise fit into `MEMORY_LIMIT_EXCEEDED`. The table uses the tracked `HashTableAllocator`, so the reserve is accounted and throws cleanly rather than being OOM-killed. Also start the counting tables at 2^14 cells instead of 256 to skip the first resizes while reading the first block. On `SELECT count() FROM (numbers(20M) INTERSECT numbers(20M))` this lowers peak memory ~33% (1.5 GiB to 1.0 GiB) and roughly halves the time (1.3s to 0.7s), matching the resize-free path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`addToSet` and `buildFilter` are only reachable for the DISTINCT operators; the ALL operators use the counting path. Make that invariant explicit with `chassert(!isAllOperator(), ...)` instead of relying on a comment, so a future dispatch mistake fails loudly in debug builds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The previous estimator summed leaf source rows, ignoring cardinality-changing steps: e.g. a right operand `SELECT count() FROM numbers(1e9)` produces one row but was estimated at 1e9, reserving the counting map for a billion keys. Walk the operand plan and only derive an estimate when the output cardinality is exact: propagate through `ExpressionStep` (1:1) and `UnionStep` (sum), use `LimitStep` as a hard upper bound (skipping WITH TIES), and read source rows only when no filter is pushed into the source. Any other step (`AggregatingStep`, `FilterStep`, `DistinctStep`, sorts, joins, ...) returns unknown, which skips the pre-reserve. MergeTree is treated as unknown (its selected row count is not available at plan-build time). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two fixes to estimatePlanOutputRows: - LimitStep returned `limit` even when the child estimate was unknown, so `SELECT count() FROM numbers(1e9) LIMIT 1e9` (one row) could still reserve for a billion keys. `limit` bounds the output only if the child has at least `limit + offset` rows; with an unknown child, return unknown. It only ever caps a known child estimate, never raises it. - Replace the `ExpressionStep` special-case with the step's `getTransformTraits().preserves_number_of_rows`. An `ExpressionStep` can contain `arrayJoin`, which is not row-preserving and is already encoded in the traits; this also generalizes to other row-preserving transforms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-06-16T16:56:03Z

Workflow [PR], commit [173d00c]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR replaces the INTERSECT ALL / EXCEPT ALL multiset implementation with CountingSetVariants, using the existing SetVariants key-method dispatch for the common numeric, fixed-size, and string key paths while preserving the previous hashed fallback for other keys. I did not find a new correctness, safety, or performance issue in the current diff.

Final Verdict

Status: ✅ Approve

…ting-perf

The row estimate read `ReadFromSystemNumbersStep::getEstimatedRowsCount` directly from the planner, which under thin-LTO pulled the step's `storage->as<StorageSystemNumbers>()` (and thus `StorageSystemNumbers`'s typeinfo) into `clickhouse-keeper`, which does not link storages: ld.lld: error: undefined symbol: typeinfo for DB::StorageSystemNumbers Expose the per-source estimate as a virtual `getOutputRowsEstimate` on `SourceStepWithFilterBase` (default unknown) and call it polymorphically, so the planner no longer references the concrete numbers step and keeper does not pull its vtable. The numbers override now also caps the count by a pushed-down LIMIT, and the planner bails when a row-level or prewhere filter is present. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The pre-reserve was sized once from the first block's distinct ratio times the row estimate. With the default `max_memory_usage = 0` its memory cap was inactive, so a non-injective right operand (e.g. `if(number < 65536, number, 0)` over a large input) whose first block looks all-distinct could reserve for ~1e8 keys though the final multiset holds ~65K, a multi-GiB upfront allocation. Reserve from within `addToCounts` instead, driven by the table's own growth: the per-row cost is just the already-computed inserted flag and a decrement (and is compiled out for the fixed key8/key16 tables that cannot resize). Right before the table would resize, an amortized review reserves straight to the estimate once the live key count is within a factor of it, capped to a quarter of the remaining memory budget. Inputs whose distinct count plateaus below the gate never trigger a reserve and just grow naturally, so the bound holds even with no memory limit set. With no estimate, small growing tables are pre-grown up to a small bound; large ones fall back to the hash table's natural growth. On `numbers(20M) INTERSECT numbers(20M)` this keeps ~0.7s and ~1.25 GiB, while the pathological `if`-collapse drops from 4 GiB to 9 MiB and a low-cardinality `number % 1000000` from 2 GiB to 79 MiB. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Master added `keys32` and `keys64` fixed-width variants to `SetVariants` (`NonClearableSet`/`ClearableSet` and `APPLY_FOR_SET_VARIANTS`). Since `CountingSet` is dispatched through the same macro, it must define the matching members or the shared `init`/`getTotalRowCount`/`getTotalByteCount` expansions fail to compile against it once this branch is merged with master: error: no member named 'keys32' in 'DB::CountingSet' Add the counting (HashMap-backed) `keys32`/`keys64` members, mirroring the existing `keys128`/`keys256`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ting-perf

Algunenano · 2026-06-18T08:17:14Z

@nickitat You might be interested in this PR. It introduces, but not universalizes yet, a new way to tweak hash table sizes, in this case using the estimated size of the right table and the available memory to tweak how fast we grow the table.

nickitat

I think the general idea of replacing SipHash is great, and we should keep this part.
But I'm not a fan of the new resizing logic. Currently, we have 4x resizes until we reach 2^23, then 2x resizes each time. So this estimation-based logic should have no effect for cardinalities up to that threshold, and then it will buy us one resize before the very last one, which will now be skipped. And only if the estimation is available, btw. I'd rather consider re-using the mechanism we already have in aggregation and joins with hashing table sizes after the fact.
I also did a quick experiment - ran your perf test without it:

  Results — min of 7–9 runs, pinned to cores 0–7

  ┌───────────────────────┬────────────────────────────────┬────────────────────────┬───────────────────────────────┐
  │ query (distinct keys) │ stable 26.7 (old SipHash path) │ branch ON (reserve on) │ branch OFF (reserve disabled) │
  ├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
  │ key64 (2.5M)          │ 1.235 s                        │ 0.175 s (7.1×)         │ 0.236 s (5.2×)                │
  ├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
  │ keys128 (2M)          │ 0.842 s                        │ 0.182 s (4.6×)         │ 0.179 s (4.7×)                │
  ├───────────────────────┼────────────────────────────────┼────────────────────────┼───────────────────────────────┤
  │ key_string (2M/1M)    │ 0.632 s                        │ 0.194 s (3.3×)         │ 0.205 s (3.1×)                │
  └───────────────────────┴────────────────────────────────┴────────────────────────┴───────────────────────────────┘

  "OFF" = the reservation review (reviewCountsReservation / addToCounts gating) compiled out; everything else identical, same toolchain.

  Your hypothesis is confirmed

  The resizing logic does not deliver the headline speedup. Disabling it costs at most 1.35× (key64), and nothing on keys128/key_string. The 3–7× comes entirely from the keying change (real-value key + narrow cell, no per-row SipHash, no fat UInt128 map) — which
  is present in both ON and OFF. OFF is still 5.2× / 4.7× / 3.1× over the old path.

  What the 1.35× on key64 actually is — and it's not resizing

  It's not resize-skipping: ON and OFF both do the same 5 resizes for 2.5M keys. The difference is the final table geometry:

  - OFF: natural ×4 growth overshoots to degree 24 = 16M cells (256 MB).
  - ON: reserve(2.5M) fits exactly to degree 23 = 8M cells (128 MB).a

Algunenano · 2026-06-24T17:13:22Z

I also did a quick experiment - ran your perf test without it:

Keep in mind that skipping / reducing the last resize is not only about performance, but about memory. Going from 1 -> 5x memory (1x for old + 4x to hold it) to 1 -> 3x (1x for old + 2x for new), means reducing the max memory usage noticeably for large tables.

I'd rather consider re-using the mechanism we already have in aggregation and joins with hashing table sizes after the fact.

I'll have a look

…ting-perf

Per review, the speedup comes entirely from keying the ALL multiset on the real row value instead of a per-row `SipHash128` digest; the estimation-based pre-sizing added complexity and a memory over-reservation footgun for only a marginal, data-dependent gain (final table geometry, not resize-skipping). Revert the on-demand reservation, the query-plan row estimate and its plumbing (`IntersectOrExceptStep` / `ReadFromSystemNumbersStep` / `SourceStepWithFilter` / `Planner`), and the custom counting-set grower. `CountingSet` now uses the default grower, exactly like the DISTINCT path's `NonClearableSet`. Improving INTERSECT/ EXCEPT memory will be handled separately. `SELECT count() FROM (numbers(20M) INTERSECT numbers(20M))`: ~6.8s -> ~0.7s; peak memory 2.25 GiB -> 1.5 GiB (narrower cell, no digest). The INTERSECT/EXCEPT test suite and results are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Algunenano · 2026-06-26T11:35:57Z

I've decided to remove the estimation logic for now and simplify the PR. I think there are still cases where it'd be useful to avoid over-reserving memory, but I need to think more about it in general. HashTablesStatistics is useful, but focuses on performance and only for subsequent runs.

nickitat

LGTM, besides one nitpicking comment

Address review: the SetMethod* structs no longer take a `Mapped` template parameter (their signatures match master). The mapped type is derived from the table - `HashSet` cells report `VoidMapped` (-> void), `HashMap` cells a real type - and the consecutive-keys cache is disabled when there is a mapped value. `CountingSet` mirrors `NonClearableSet` with `HashMap`s; a method stays a set when given a `HashSet` and becomes a counting multiset when given a `HashMap`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-06-26T16:53:53Z

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR rewrites only the INTERSECT ALL / EXCEPT ALL execution path, replacing a SipHash-based occurrence-count map with a typed multiset; the SetVariants edits are template-level no-ops for the existing presence-only sets. ClickBench contains no INTERSECT/EXCEPT queries, so neither flagged improvement (Q4 -15.27%, Q15 -20.16%) can be caused by this change. Both deltas are below the trust-override threshold and are best read as run-to-run variance, so both have been downgraded to not_sure. No real per-query regressions or improvements attributable to this PR were found on either benchmark.

clickbench

⚠️ 2 inconclusive

Flagged queries (2 of 43)

	Query	Verdict	Baseline median (ms)	PR median (ms)	Change	q-value	Hint
⚠️	4	not_sure	262	222	-15.3%	<0.0001	PR only changes INTERSECT ALL/EXCEPT ALL counting path; ClickBench has no such queries, so -15% is variance
⚠️	15	not_sure	248	198	-20.2%	<0.0001	Off-path: PR touches only INTERSECT/EXCEPT ALL transform; Q15 doesn't exercise it, -20% is run-to-run variance

_{q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.}

tpch_adapted_1_official

🟢 No significant changes

Debug info

StressHouse run: d6f030af-4702-41ec-88a9-fe1ef61b4c1a
MIRAI run: b27178c7-f3de-4f9d-9837-87ef56012e8c
PR check IDs:
- clickbench_162347_1782491652
- clickbench_162353_1782491652
- clickbench_162363_1782491652
- tpch_adapted_1_official_162371_1782491652
- tpch_adapted_1_official_162385_1782491652
- tpch_adapted_1_official_162397_1782491652

clickhouse-gh · 2026-06-26T17:36:51Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 53/59 (89.83%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 2 line(s) · Uncovered code

Full report · Diff report

Algunenano and others added 5 commits June 16, 2026 17:37

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Jun 16, 2026

clickhouse-gh Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/Processors/Transforms/IntersectOrExceptTransform.cpp Outdated

Algunenano and others added 5 commits June 17, 2026 12:31

Merge remote-tracking branch 'blessed/master' into intersect-all-coun…

457411c

…ting-perf

Merge remote-tracking branch 'blessed/master' into intersect-all-coun…

40c58c8

…ting-perf

nickitat self-assigned this Jun 18, 2026

nickitat reviewed Jun 24, 2026

View reviewed changes

Algunenano and others added 2 commits June 26, 2026 13:10

Merge remote-tracking branch 'blessed/master' into intersect-all-coun…

d7f2602

…ting-perf

nickitat approved these changes Jun 26, 2026

View reviewed changes

Comment thread src/Interpreters/SetVariants.h Outdated

Algunenano added this pull request to the merge queue Jun 29, 2026

Merged via the queue into ClickHouse:master with commit af19657 Jun 29, 2026
327 of 330 checks passed

Algunenano deleted the intersect-all-counting-perf branch June 29, 2026 18:11

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 29, 2026

alexey-milovidov mentioned this pull request Jul 2, 2026

Compress hash join right-side blocks in memory before OOM / spilling #107667

Open

1 task

Right operand key	master	this PR
`UInt64`	6.8 s, 2.25 GiB	0.7 s, 1.25 GiB
two `UInt64` columns	6.7 s, 2.25 GiB	1.4 s, 1.88 GiB
`String` (`EXCEPT ALL`)	2.1 s, 483 MiB	0.6 s, 659 MiB

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up INTERSECT ALL and EXCEPT ALL#107649

Speed up INTERSECT ALL and EXCEPT ALL#107649
Algunenano merged 13 commits into
ClickHouse:masterfrom
Algunenano:intersect-all-counting-perf

Algunenano commented Jun 16, 2026 •

edited by robot-ch-test-poll4

Loading

Uh oh!

clickhouse-gh Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Algunenano commented Jun 18, 2026

Uh oh!

nickitat left a comment

Uh oh!

Algunenano commented Jun 24, 2026

Uh oh!

Algunenano commented Jun 26, 2026

Uh oh!

nickitat left a comment

Uh oh!

Uh oh!

clickhouse-gh Bot commented Jun 26, 2026

Uh oh!

clickhouse-gh Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Algunenano commented Jun 16, 2026 • edited by robot-ch-test-poll4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Version info

Uh oh!

clickhouse-gh Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Algunenano commented Jun 18, 2026

Uh oh!

nickitat left a comment

Choose a reason for hiding this comment

Uh oh!

Algunenano commented Jun 24, 2026

Uh oh!

Algunenano commented Jun 26, 2026

Uh oh!

nickitat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clickhouse-gh Bot commented Jun 26, 2026

clickbench

tpch_adapted_1_official

Uh oh!

clickhouse-gh Bot commented Jun 26, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Algunenano commented Jun 16, 2026 •

edited by robot-ch-test-poll4

Loading

clickhouse-gh Bot commented Jun 16, 2026 •

edited

Loading