Fix aggregate projection not matched when multiple sumIf use different IN sets by Ergus · Pull Request #104765 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix aggregate projection not matched when multiple sumIf use different IN sets#104765

Merged
KochetovNicolai merged 16 commits into
ClickHouse:masterfrom
Ergus:Fix_SE_7671
Jun 2, 2026
Merged

Fix aggregate projection not matched when multiple sumIf use different IN sets#104765
KochetovNicolai merged 16 commits into
ClickHouse:masterfrom
Ergus:Fix_SE_7671

Conversation

@Ergus

@Ergus Ergus commented May 12, 2026

Copy link
Copy Markdown
Member

Problem

An aggregate projection with two or more sumIf(val, col IN (...)) aggregates was broken in two ways:

  • Wrong results: when both aggregates used the same value column, both were silently mapped to the same pre-aggregated projection column, returning incorrect values.
  • Projection unused: when the aggregates used different value columns, the projection was rejected entirely and the query fell back to a full table scan.

The root cause is that ColumnSet::operator[] (the constant argument of every in(col, ...) node) always returns an empty Field{} regardless of content.

The projection matcher therefore could not distinguish IN ('a','b') from IN ('c','d') and mapped all query in nodes to the same projection node.

Fix

Add a content-based, element-order-independent hash to FutureSetFromTuple, computed from the actual set element data at construction time. The projection matcher uses this hash instead of getField() when comparing IN-clause constants, correctly distinguishing different sets while still treating IN ('a','b') and IN ('b','a') as equivalent.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix wrong results or missed projection when an aggregate projection contains multiple sumIf aggregates with different IN (...) conditions.

/fix https://github.com/ClickHouse/support-escalation/issues/7671

Version info

  • Merged into: 26.6.1.321

@clickhouse-gh

clickhouse-gh Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 12, 2026
Comment thread src/Interpreters/PreparedSets.cpp Outdated
Ergus added 2 commits May 12, 2026 23:37
Use getKeyColumns instead of the blocks to avoid duplications.
Comment thread src/Interpreters/PreparedSets.cpp Outdated
Ergus added 2 commits May 13, 2026 00:21
sortBlock resolves sort keys by name, so duplicate names all map to the first matching column in the Block.
That means the hash order is only sorted by the first component, and semantically equivalent sets can still
produce different getContentHash values when rows tie on that component but appear in different input orders.

Partially reverts 4fadf63
@Ergus Ergus requested a review from KochetovNicolai May 13, 2026 00:27
Comment thread src/Interpreters/PreparedSets.cpp Outdated
@Ergus Ergus marked this pull request as ready for review May 13, 2026 11:06
@KochetovNicolai KochetovNicolai self-assigned this May 15, 2026
Comment thread src/Interpreters/PreparedSets.cpp Outdated
Comment on lines +146 to +152
/// Compute a content-based hash. Used by the aggregate projection matcher
/// (actionsDAGUtils.cpp) to distinguish different IN-clause sets.
///
/// We hash the normalized set elements (deduplicated, NULL-filtered, sorted) rather
/// than the raw input block so that permutations and duplicate inputs are equivalent.
/// getPermutation/updatePermutation operate by column index, so sets with repeated
/// element types (e.g. (String, String)) are handled correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that this change can lead to performance degradation, especially for the large sets.

We can do it lazily, and only for relatively small tuples.
I don't really have a good idea of how to improve it further. Was thinking about something like comparing bloom filter states first, which may be cheaper than sorting. But looks like overkill.

@KochetovNicolai KochetovNicolai left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's implement lazy calculation.
And probably some limit on set size.

@Ergus

Ergus commented May 15, 2026

Copy link
Copy Markdown
Member Author

Let's implement lazy calculation. And probably some limit on set size.

Hi @KochetovNicolai

What could be a reasonable default limit for a set size? I won't expect to see a performance degradation unless the set is absurdly large like in the order of 10^4 elements or more, but I could be obviously wrong.

Also, what do you think is expected to happen when the limit is violated? trigger an error? fall back?

@alexey-milovidov

Copy link
Copy Markdown
Member

This was fixed by #105146. Let's update the branch.

@KochetovNicolai

Copy link
Copy Markdown
Member

What could be a reasonable default limit for a set size? I won't expect to see a performance degradation unless the set is absurdly large like in the order of 10^4 elements or more, but I could be obviously wrong.

Agree, maybe 10^3 or 10^4 is ok.

Also, what do you think is expected to happen when the limit is violated? trigger an error? fall back?

I think in our case, we can say that every "big" set is not equal to any other set (we won't apply projection and so on)

Comment thread src/Interpreters/PreparedSets.cpp Outdated
@Ergus Ergus requested a review from KochetovNicolai May 19, 2026 14:18
@KochetovNicolai

Copy link
Copy Markdown
Member

I agree with AI comment and something like use_index_for_in_with_subqueries_max_values is still needed.

Extract `getUniqueKeyColumns` that applies the deduplication filter from `set_key_columns` directly, without touching `set->set_elements`.
`computeContentHash` now uses this helper; `getKeyColumns` is unchanged for its existing callers (`SetsSerialization`, `QueryPlanFormat`).
@Ergus

Ergus commented May 29, 2026

Copy link
Copy Markdown
Member Author

I agree with AI comment and something like use_index_for_in_with_subqueries_max_values is still needed.

The change alread uses use_index_for_in_with_subqueries_max_values

The AI comment was because computeContentHash was calling getKeyColumns, which calls fillSetElementsOnce and permanently materializes set->set_elements, bypassing the large-set guard in buildOrderedSetInplace.

I fixed that in 0521377

if (!normalized.empty() && normalized_rows > 0)
{
EqualRanges ranges{{0, normalized_rows}};
normalized[0]->getPermutation(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeContentHash sorts with column comparison, but Set stores single numeric keys by their raw fixed-width key (key64 for Float64). Values like 0.0 and -0.0 (and distinct NaN payloads) compare equal for sorting, so the stable permutation keeps their input order; then updateHashFast hashes the raw bytes. As a result, IN (0.0, -0.0) and IN (-0.0, 0.0) can still produce different content hashes even though they contain the same Set keys, which violates the order-independent hash contract and can miss projection matches.

Please add a raw-byte tie-breaker for compare-equal values (or hash from the same key representation used by Set) and cover this with a reversed +0.0/-0.0 or distinct-NaN regression test.

Add query_plan_max_set_size_for_projection_match (default 10000, 0 =
disabled), which bounds the cost of comparing IN-clause sets by content
hash in the aggregate projection matcher. Sets larger than the limit
are treated as non-matching; setting the value to 0 disables the
comparison entirely (compatibility value, matches pre-fix behavior).

matchTrees grows a max_size_for_sets_from_tuple_to_compare parameter so
that callers other than the aggregate projection matcher can opt in
later. The ColumnSet branch now typeid_casts both sides to
FutureSetFromTuple, short-circuits on deduplicated-row-count
inequality, and only computes the O(N log N) content hash when both
sets fit under the limit. Storage and subquery sets are left
non-matching: their content is not known at planning time.

getContentHash is now non-virtual and exists only on
FutureSetFromTuple; the no-op overrides on FutureSetFromStorage and
FutureSetFromSubquery (which forwarded to getHash) are removed.
@KochetovNicolai

Copy link
Copy Markdown
Member

I've still added a separate setting. Looks like use_index_for_in_with_subqueries_max_values uses a bit differen semantic. Also, it's 0 by default (so no limit on set size), and I think is make sense to limit set compariosn by default.

Feel free to revert if you don't like my change.

@KochetovNicolai

Copy link
Copy Markdown
Member

Note:

Can we make FutureSetFromTuple::computeContentHash() implementation without sorting? Is there     
  position-independent hashing?                                                                     

● Yes — a set is unordered by definition, so you don't need to sort to hash it canonically. The
  standard trick is a commutative combination of per-row hashes:

  - Hash each row to a fixed-width value (e.g. SipHash128/CityHash128 over the serialized columns of
   that row).
  - Combine row hashes with a commutative op: addition mod 2^128 (i.e. __uint128_t accumulator).
  This is the textbook "AdHash" multiset hash — preserves multiplicity, is O(N), trivially
  parallelizable (partial sums per thread), and provably as collision-resistant as the underlying
  per-row hash. XOR works too but only as a set (not multiset) hash, because pairs of duplicates
  cancel — usually undesirable.

  Main tradeoff: the combiner is cheap, but the per-row hash dominates cost, so the win over
  sort+hash is only ~`log N. The real wins are (a) no extra memory for sorted copies, (b) 
  streaming/incremental update as rows are inserted, and (c) easy parallelization. If you also want 
  it position-independent across *column order*, you'd hash (column_name, value)` pairs and sum
  inside each row first.

But for this, we need to add something like IColumn::getPositionIndependentHash, so probably later.

/// Sizes are deduplicated counts; different sizes ⇒ different
/// contents, so skip hashing in that case.
if (inner_rows == outer_rows
&& inner_rows <= max_size_for_sets_from_tuple_to_compare)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KochetovNicolai

The only problem with this part of the fix is that tuples with repeated elements could have different sizes, but still be equivalent once deduplicated the repeated elements.

That's why the initial code went directly to the hash and ignored the total size.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. Will add a test and will remove this check if getTotalRowCount returns count without deduplication (I am not confident it does)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test - looks like duplicates are handled correctly.

Add EXPLAIN cases that exercise the deduplicated-size pre-filter in
matchTrees against four shapes: equal sets despite duplicates in the
query, same dedup size with differing contents, smaller set, and
duplicates that dedup to a different set. Plus a correctness check
that a query with duplicates produces the same result with and
without the projection.

The cases confirm that `Set::getTotalRowCount` (dedup count) and
`getContentHash` (computed on dedup'd elements) agree, so the size
pre-filter cannot cause false positives.
@KochetovNicolai

Copy link
Copy Markdown
Member

PR / Stateless tests (arm_asan_ubsan, targeted) (pull_request) Failing after 55m

● Summary

  This is a CI infrastructure flake, not a real failure caused by the PR.

  What happened:
  - The job is Stateless tests (arm_asan_ubsan, targeted), running ~60 "auto-targeted" optimization/projection tests with --test-runs 50 --jobs 19 --flaky-check.
  - All 2837 attempted test runs passed (job.log:4609 says Failed: 0, Passed: 2837).
  - The wrapper killed the runner with SIGTERM after the ~47 min wall-clock timeout (Timeout exceeded [2817], job.log:4453).
  - The runner exited with -15, which is in ABORTED_RUN_EXIT_CODES, so functional_tests_results.py:238 synthesizes a "Server died" leaf for the report — that's misleading framing for a timeout kill.
  - The actual server process shut down cleanly afterwards: Background threads finished, grpc_shutdown done, Child process exited normally with code 0 (clickhouse-server.log). No <Fatal>, no sanitizer report, no stack trace in
  clickhouse-server.err.log.

  Why it timed out:
  The targeted selector grabbed every test whose name matches optimization/projection/plan (≈60 tests, several slow: 01655_plan_optimizations ~52s, 02317_distinct_in_order_optimization_explain ~60s, 03921_kafka_formats ~38s,
  03031_read_in_order_optimization_with_virtual_row*, ...). Multiplying by --test-runs 50 on ARM ASan+UBSan blew through the 47 min budget. Nothing in the diff makes those tests slower — the targeted selection itself is just too large for
   this config.

  Recommendation: rerun the job. Unrelated to your projection/sumIf change; safe to ignore for the PR's correctness.


@clickhouse-gh

clickhouse-gh Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.50% +0.10%
Functions 91.40% 91.40% +0.00%
Branches 77.00% 77.00% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 224/225 (99.56%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@KochetovNicolai KochetovNicolai added this pull request to the merge queue Jun 2, 2026
Merged via the queue into ClickHouse:master with commit 5baed12 Jun 2, 2026
164 of 167 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 2, 2026
@clickgapai

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants