Fix aggregate projection not matched when multiple sumIf use different IN sets#104765
Conversation
Include a sorting code to ensure (a, b) hashes equal to (b, a)
Uses the fiddle examples.
Use getKeyColumns instead of the blocks to avoid duplications.
sortBlock resolves sort keys by name, so duplicate names all map to the first matching column in the Block. That means the hash order is only sorted by the first component, and semantically equivalent sets can still produce different getContentHash values when rows tie on that component but appear in different input orders. Partially reverts 4fadf63
| /// Compute a content-based hash. Used by the aggregate projection matcher | ||
| /// (actionsDAGUtils.cpp) to distinguish different IN-clause sets. | ||
| /// | ||
| /// We hash the normalized set elements (deduplicated, NULL-filtered, sorted) rather | ||
| /// than the raw input block so that permutations and duplicate inputs are equivalent. | ||
| /// getPermutation/updatePermutation operate by column index, so sets with repeated | ||
| /// element types (e.g. (String, String)) are handled correctly. |
There was a problem hiding this comment.
I am concerned that this change can lead to performance degradation, especially for the large sets.
We can do it lazily, and only for relatively small tuples.
I don't really have a good idea of how to improve it further. Was thinking about something like comparing bloom filter states first, which may be cheaper than sorting. But looks like overkill.
KochetovNicolai
left a comment
There was a problem hiding this comment.
Let's implement lazy calculation.
And probably some limit on set size.
What could be a reasonable default limit for a set size? I won't expect to see a performance degradation unless the set is absurdly large like in the order of 10^4 elements or more, but I could be obviously wrong. Also, what do you think is expected to happen when the limit is violated? trigger an error? fall back? |
|
This was fixed by #105146. Let's update the branch. |
Agree, maybe 10^3 or 10^4 is ok.
I think in our case, we can say that every "big" set is not equal to any other set (we won't apply projection and so on) |
|
I agree with AI comment and something like |
Extract `getUniqueKeyColumns` that applies the deduplication filter from `set_key_columns` directly, without touching `set->set_elements`. `computeContentHash` now uses this helper; `getKeyColumns` is unchanged for its existing callers (`SetsSerialization`, `QueryPlanFormat`).
The change alread uses The AI comment was because I fixed that in 0521377 |
| if (!normalized.empty() && normalized_rows > 0) | ||
| { | ||
| EqualRanges ranges{{0, normalized_rows}}; | ||
| normalized[0]->getPermutation( |
There was a problem hiding this comment.
computeContentHash sorts with column comparison, but Set stores single numeric keys by their raw fixed-width key (key64 for Float64). Values like 0.0 and -0.0 (and distinct NaN payloads) compare equal for sorting, so the stable permutation keeps their input order; then updateHashFast hashes the raw bytes. As a result, IN (0.0, -0.0) and IN (-0.0, 0.0) can still produce different content hashes even though they contain the same Set keys, which violates the order-independent hash contract and can miss projection matches.
Please add a raw-byte tie-breaker for compare-equal values (or hash from the same key representation used by Set) and cover this with a reversed +0.0/-0.0 or distinct-NaN regression test.
Add query_plan_max_set_size_for_projection_match (default 10000, 0 = disabled), which bounds the cost of comparing IN-clause sets by content hash in the aggregate projection matcher. Sets larger than the limit are treated as non-matching; setting the value to 0 disables the comparison entirely (compatibility value, matches pre-fix behavior). matchTrees grows a max_size_for_sets_from_tuple_to_compare parameter so that callers other than the aggregate projection matcher can opt in later. The ColumnSet branch now typeid_casts both sides to FutureSetFromTuple, short-circuits on deduplicated-row-count inequality, and only computes the O(N log N) content hash when both sets fit under the limit. Storage and subquery sets are left non-matching: their content is not known at planning time. getContentHash is now non-virtual and exists only on FutureSetFromTuple; the no-op overrides on FutureSetFromStorage and FutureSetFromSubquery (which forwarded to getHash) are removed.
|
I've still added a separate setting. Looks like Feel free to revert if you don't like my change. |
|
Note: But for this, we need to add something like |
| /// Sizes are deduplicated counts; different sizes ⇒ different | ||
| /// contents, so skip hashing in that case. | ||
| if (inner_rows == outer_rows | ||
| && inner_rows <= max_size_for_sets_from_tuple_to_compare) |
There was a problem hiding this comment.
The only problem with this part of the fix is that tuples with repeated elements could have different sizes, but still be equivalent once deduplicated the repeated elements.
That's why the initial code went directly to the hash and ignored the total size.
There was a problem hiding this comment.
That's a good catch. Will add a test and will remove this check if getTotalRowCount returns count without deduplication (I am not confident it does)
There was a problem hiding this comment.
I've added a test - looks like duplicates are handled correctly.
Add EXPLAIN cases that exercise the deduplicated-size pre-filter in matchTrees against four shapes: equal sets despite duplicates in the query, same dedup size with differing contents, smaller set, and duplicates that dedup to a different set. Plus a correctness check that a query with duplicates produces the same result with and without the projection. The cases confirm that `Set::getTotalRowCount` (dedup count) and `getContentHash` (computed on dedup'd elements) agree, so the size pre-filter cannot cause false positives.
|
PR / Stateless tests (arm_asan_ubsan, targeted) (pull_request) Failing after 55m |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 224/225 (99.56%) | Lost baseline coverage: none · Uncovered code |
5baed12

Problem
An aggregate projection with two or more
sumIf(val, col IN (...))aggregates was broken in two ways:The root cause is that
ColumnSet::operator[](the constant argument of every in(col, ...) node) always returns an empty Field{} regardless of content.The projection matcher therefore could not distinguish IN ('a','b') from IN ('c','d') and mapped all query in nodes to the same projection node.
Fix
Add a content-based, element-order-independent hash to FutureSetFromTuple, computed from the actual set element data at construction time. The projection matcher uses this hash instead of getField() when comparing IN-clause constants, correctly distinguishing different sets while still treating IN ('a','b') and IN ('b','a') as equivalent.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix wrong results or missed projection when an aggregate projection contains multiple
sumIfaggregates with differentIN (...)conditions./fix https://github.com/ClickHouse/support-escalation/issues/7671
Version info
26.6.1.321