iframe-proxy

Ergus · 2026-05-12T23:09:34Z

Problem

An aggregate projection with two or more sumIf(val, col IN (...)) aggregates was broken in two ways:

Wrong results: when both aggregates used the same value column, both were silently mapped to the same pre-aggregated projection column, returning incorrect values.
Projection unused: when the aggregates used different value columns, the projection was rejected entirely and the query fell back to a full table scan.

The root cause is that ColumnSet::operator[] (the constant argument of every in(col, ...) node) always returns an empty Field{} regardless of content.

The projection matcher therefore could not distinguish IN ('a','b') from IN ('c','d') and mapped all query in nodes to the same projection node.

Fix

Add a content-based, element-order-independent hash to FutureSetFromTuple, computed from the actual set element data at construction time. The projection matcher uses this hash instead of getField() when comparing IN-clause constants, correctly distinguishing different sets while still treating IN ('a','b') and IN ('b','a') as equivalent.

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix wrong results or missed projection when an aggregate projection contains multiple sumIf aggregates with different IN (...) conditions.

/fix https://github.com/ClickHouse/support-escalation/issues/7671

Version info

Merged into: 26.6.1.321

Include a sorting code to ensure (a, b) hashes equal to (b, a)

Uses the fiddle examples.

clickhouse-gh · 2026-05-12T23:10:25Z

Use getKeyColumns instead of the blocks to avoid duplications.

sortBlock resolves sort keys by name, so duplicate names all map to the first matching column in the Block. That means the hash order is only sorted by the first component, and semantically equivalent sets can still produce different getContentHash values when rows tie on that component but appear in different input orders. Partially reverts 4fadf63

KochetovNicolai · 2026-05-15T15:51:19Z

+    /// Compute a content-based hash. Used by the aggregate projection matcher
+    /// (actionsDAGUtils.cpp) to distinguish different IN-clause sets.
+    ///
+    /// We hash the normalized set elements (deduplicated, NULL-filtered, sorted) rather
+    /// than the raw input block so that permutations and duplicate inputs are equivalent.
+    /// getPermutation/updatePermutation operate by column index, so sets with repeated
+    /// element types (e.g. (String, String)) are handled correctly.


I am concerned that this change can lead to performance degradation, especially for the large sets.

We can do it lazily, and only for relatively small tuples.
I don't really have a good idea of how to improve it further. Was thinking about something like comparing bloom filter states first, which may be cheaper than sorting. But looks like overkill.

KochetovNicolai

Let's implement lazy calculation.
And probably some limit on set size.

Ergus · 2026-05-15T16:38:20Z

Let's implement lazy calculation. And probably some limit on set size.

Hi @KochetovNicolai

What could be a reasonable default limit for a set size? I won't expect to see a performance degradation unless the set is absurdly large like in the order of 10^4 elements or more, but I could be obviously wrong.

Also, what do you think is expected to happen when the limit is violated? trigger an error? fall back?

alexey-milovidov · 2026-05-17T18:42:03Z

This was fixed by #105146. Let's update the branch.

KochetovNicolai · 2026-05-18T13:02:44Z

What could be a reasonable default limit for a set size? I won't expect to see a performance degradation unless the set is absurdly large like in the order of 10^4 elements or more, but I could be obviously wrong.

Agree, maybe 10^3 or 10^4 is ok.

Also, what do you think is expected to happen when the limit is violated? trigger an error? fall back?

I think in our case, we can say that every "big" set is not equal to any other set (we won't apply projection and so on)

KochetovNicolai · 2026-05-29T12:40:34Z

I agree with AI comment and something like use_index_for_in_with_subqueries_max_values is still needed.

Extract `getUniqueKeyColumns` that applies the deduplication filter from `set_key_columns` directly, without touching `set->set_elements`. `computeContentHash` now uses this helper; `getKeyColumns` is unchanged for its existing callers (`SetsSerialization`, `QueryPlanFormat`).

Ergus · 2026-05-29T16:38:18Z

I agree with AI comment and something like use_index_for_in_with_subqueries_max_values is still needed.

The change alread uses use_index_for_in_with_subqueries_max_values

The AI comment was because computeContentHash was calling getKeyColumns, which calls fillSetElementsOnce and permanently materializes set->set_elements, bypassing the large-set guard in buildOrderedSetInplace.

I fixed that in 0521377

clickhouse-gh · 2026-05-29T16:46:32Z

+    if (!normalized.empty() && normalized_rows > 0)
+    {
+        EqualRanges ranges{{0, normalized_rows}};
+        normalized[0]->getPermutation(


computeContentHash sorts with column comparison, but Set stores single numeric keys by their raw fixed-width key (key64 for Float64). Values like 0.0 and -0.0 (and distinct NaN payloads) compare equal for sorting, so the stable permutation keeps their input order; then updateHashFast hashes the raw bytes. As a result, IN (0.0, -0.0) and IN (-0.0, 0.0) can still produce different content hashes even though they contain the same Set keys, which violates the order-independent hash contract and can miss projection matches.

Please add a raw-byte tie-breaker for compare-equal values (or hash from the same key representation used by Set) and cover this with a reversed +0.0/-0.0 or distinct-NaN regression test.

Add query_plan_max_set_size_for_projection_match (default 10000, 0 = disabled), which bounds the cost of comparing IN-clause sets by content hash in the aggregate projection matcher. Sets larger than the limit are treated as non-matching; setting the value to 0 disables the comparison entirely (compatibility value, matches pre-fix behavior). matchTrees grows a max_size_for_sets_from_tuple_to_compare parameter so that callers other than the aggregate projection matcher can opt in later. The ColumnSet branch now typeid_casts both sides to FutureSetFromTuple, short-circuits on deduplicated-row-count inequality, and only computes the O(N log N) content hash when both sets fit under the limit. Storage and subquery sets are left non-matching: their content is not known at planning time. getContentHash is now non-virtual and exists only on FutureSetFromTuple; the no-op overrides on FutureSetFromStorage and FutureSetFromSubquery (which forwarded to getHash) are removed.

KochetovNicolai · 2026-05-29T17:35:42Z

I've still added a separate setting. Looks like use_index_for_in_with_subqueries_max_values uses a bit differen semantic. Also, it's 0 by default (so no limit on set size), and I think is make sense to limit set compariosn by default.

Feel free to revert if you don't like my change.

KochetovNicolai · 2026-05-29T17:46:09Z

Note:

Can we make FutureSetFromTuple::computeContentHash() implementation without sorting? Is there     
  position-independent hashing?                                                                     

● Yes — a set is unordered by definition, so you don't need to sort to hash it canonically. The
  standard trick is a commutative combination of per-row hashes:

  - Hash each row to a fixed-width value (e.g. SipHash128/CityHash128 over the serialized columns of
   that row).
  - Combine row hashes with a commutative op: addition mod 2^128 (i.e. __uint128_t accumulator).
  This is the textbook "AdHash" multiset hash — preserves multiplicity, is O(N), trivially
  parallelizable (partial sums per thread), and provably as collision-resistant as the underlying
  per-row hash. XOR works too but only as a set (not multiset) hash, because pairs of duplicates
  cancel — usually undesirable.

  Main tradeoff: the combiner is cheap, but the per-row hash dominates cost, so the win over
  sort+hash is only ~`log N. The real wins are (a) no extra memory for sorted copies, (b) 
  streaming/incremental update as rows are inserted, and (c) easy parallelization. If you also want 
  it position-independent across *column order*, you'd hash (column_name, value)` pairs and sum
  inside each row first.

But for this, we need to add something like IColumn::getPositionIndependentHash, so probably later.

Ergus · 2026-06-01T11:10:54Z

+                                                        /// Sizes are deduplicated counts; different sizes ⇒ different
+                                                        /// contents, so skip hashing in that case.
+                                                        if (inner_rows == outer_rows
+                                                            && inner_rows <= max_size_for_sets_from_tuple_to_compare)


@KochetovNicolai

The only problem with this part of the fix is that tuples with repeated elements could have different sizes, but still be equivalent once deduplicated the repeated elements.

That's why the initial code went directly to the hash and ignored the total size.

That's a good catch. Will add a test and will remove this check if getTotalRowCount returns count without deduplication (I am not confident it does)

I've added a test - looks like duplicates are handled correctly.

Add EXPLAIN cases that exercise the deduplicated-size pre-filter in matchTrees against four shapes: equal sets despite duplicates in the query, same dedup size with differing contents, smaller set, and duplicates that dedup to a different set. Plus a correctness check that a query with duplicates produces the same result with and without the projection. The cases confirm that `Set::getTotalRowCount` (dedup count) and `getContentHash` (computed on dedup'd elements) agree, so the size pre-filter cannot cause false positives.

KochetovNicolai · 2026-06-01T17:15:30Z

PR / Stateless tests (arm_asan_ubsan, targeted) (pull_request) Failing after 55m

● Summary

  This is a CI infrastructure flake, not a real failure caused by the PR.

  What happened:
  - The job is Stateless tests (arm_asan_ubsan, targeted), running ~60 "auto-targeted" optimization/projection tests with --test-runs 50 --jobs 19 --flaky-check.
  - All 2837 attempted test runs passed (job.log:4609 says Failed: 0, Passed: 2837).
  - The wrapper killed the runner with SIGTERM after the ~47 min wall-clock timeout (Timeout exceeded [2817], job.log:4453).
  - The runner exited with -15, which is in ABORTED_RUN_EXIT_CODES, so functional_tests_results.py:238 synthesizes a "Server died" leaf for the report — that's misleading framing for a timeout kill.
  - The actual server process shut down cleanly afterwards: Background threads finished, grpc_shutdown done, Child process exited normally with code 0 (clickhouse-server.log). No <Fatal>, no sanitizer report, no stack trace in
  clickhouse-server.err.log.

  Why it timed out:
  The targeted selector grabbed every test whose name matches optimization/projection/plan (≈60 tests, several slow: 01655_plan_optimizations ~52s, 02317_distinct_in_order_optimization_explain ~60s, 03921_kafka_formats ~38s,
  03031_read_in_order_optimization_with_virtual_row*, ...). Multiplying by --test-runs 50 on ARM ASan+UBSan blew through the 47 min budget. Nothing in the diff makes those tests slower — the targeted selection itself is just too large for
   this config.

  Recommendation: rerun the job. Unrelated to your projection/sumIf change; safe to ignore for the PR's correctness.

clickhouse-gh · 2026-06-01T17:25:45Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.40%	84.50%	+0.10%
Functions	91.40%	91.40%	+0.00%
Branches	77.00%	77.00%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 224/225 (99.56%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

clickgapai · 2026-06-03T00:40:27Z

Ergus added 4 commits May 12, 2026 22:54

Add hashing function to PreparedSets

f26bf54

Include a sorting code to ensure (a, b) hashes equal to (b, a)

Add test for the new has function

3ecc376

Use the new function to compare in matchTrees

89326bb

Add test for the full fix

fa8b4d4

Uses the fiddle examples.

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 12, 2026

clickhouse-gh Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/Interpreters/PreparedSets.cpp Outdated

Ergus added 2 commits May 12, 2026 23:37

Fix IN input duplications.

b16ce18

Use getKeyColumns instead of the blocks to avoid duplications.

Simplify sorting code

4fadf63

clickhouse-gh Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/Interpreters/PreparedSets.cpp Outdated

Ergus added 2 commits May 13, 2026 00:21

Add failing test with ties

ea83105

Ergus requested a review from KochetovNicolai May 13, 2026 00:27

clickhouse-gh Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/Interpreters/PreparedSets.cpp Outdated

Ergus marked this pull request as ready for review May 13, 2026 11:06

KochetovNicolai self-assigned this May 15, 2026

KochetovNicolai reviewed May 15, 2026

View reviewed changes

KochetovNicolai requested changes May 15, 2026

View reviewed changes

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

Merge branch 'master' into Fix_SE_7671

155272a

Implement lazy ONCE hash computation

73f5a11

clickhouse-gh Bot reviewed May 18, 2026

View reviewed changes

Comment thread src/Interpreters/PreparedSets.cpp Outdated

Avoid side effect in third function

66a2f40

Ergus requested a review from KochetovNicolai May 19, 2026 14:18

clickhouse-gh Bot reviewed May 29, 2026

View reviewed changes

KochetovNicolai added 2 commits May 29, 2026 17:32

Merge branch 'master' into Fix_SE_7671-2

566b53d

KochetovNicolai approved these changes May 29, 2026

View reviewed changes

Merge branch 'master' into Fix_SE_7671

a14510c

Ergus commented Jun 1, 2026

View reviewed changes

KochetovNicolai added this pull request to the merge queue Jun 2, 2026

Merged via the queue into ClickHouse:master with commit 5baed12 Jun 2, 2026
164 of 167 checks passed

robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 2, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Ergus commented May 12, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Version info

Uh oh!

clickhouse-gh Bot commented May 12, 2026 • edited by KochetovNicolai Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KochetovNicolai May 15, 2026

Choose a reason for hiding this comment

Uh oh!

KochetovNicolai left a comment

Choose a reason for hiding this comment

Uh oh!

Ergus commented May 15, 2026

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

KochetovNicolai commented May 18, 2026

Uh oh!

Uh oh!

KochetovNicolai commented May 29, 2026

Uh oh!

Ergus commented May 29, 2026

Uh oh!

clickhouse-gh Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

KochetovNicolai commented May 29, 2026

Uh oh!

KochetovNicolai commented May 29, 2026

Uh oh!

Ergus Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

KochetovNicolai Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

KochetovNicolai Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

KochetovNicolai commented Jun 1, 2026

Uh oh!

clickhouse-gh Bot commented Jun 1, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

clickgapai commented Jun 3, 2026

TL;DR

Reproduction

Bisect

Affects

Suggested next step

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Ergus commented May 12, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented May 12, 2026 •

edited by KochetovNicolai

Loading