iframe-proxy

KochetovNicolai · 2026-06-15T16:09:58Z

Claude

What changes:

isEqualGlobal / getTreeHashGlobal keep the previous semantics exactly — column
sources are compared by content, so the result is canonical across query trees and
processes. Every existing call site keeps this behavior (all QueryTreeNodeWithHash
containers, now spelled ...WithGlobalHash..., and the cross-tree identifiers:
PreparedSets keys, distributed set names, the hash-join cache key, per-shard subquery
deduplication).
isEqualLocal / getTreeHashLocal compare column sources by id: cheap, instance-aware,
and meaningful within a single query tree and its clones. Used by
ValidationUtils::compareGroupByKeys, so a self join's GROUP BY keys are now
distinguished by source id instead of by alias.
A column source internal to the compared subtree (reachable through children — a lambda,
an interpolate, or a subquery used as an expression) is compared by content even in
local mode; an external source (the enclosing query's join tree) is compared by id. This
is positional rather than node-type based, so a GROUP BY key written as a full
expression matches the equal projection expression even when each was resolved
independently into a distinct lambda or subquery instance (most visibly on a distributed
shard that re-parses the GROUP BY expression).
Cloning preserves column source ids by default, so a clone is the same logical source and
compares equal to the original. cloneWithFreshColumnSourceIds is used where a clone is a
genuinely separate instance that may coexist with the original in one tree — CTE expansion
in QueryAnalyzer.

There is no user-visible behavior change: existing sites keep global (previous) semantics,
and switching GROUP BY validation to local id comparison is equivalent to the previous
alias-based comparison for valid queries.

This PR is stacked on the group_by_use_nulls typed-comparison branch (it builds on the
removal of the compare_types option), so it should be reviewed/merged after it.

Changelog category (leave one):

Not for changelog (changelog entry is not required)

Introduce `IColumnSourceNode`, a base class for query tree nodes that can be a column source: `TableNode`, `TableFunctionNode`, `QueryNode`, `UnionNode`, `ArrayJoinNode`, `JoinNode` (for columns from the USING clause), `InterpolateNode`, `LambdaNode`. Each instance gets a `ColumnSourceId` unique within the process, assigned at construction, so a clone is a new source instance with a fresh id. This matters for a self join of a materialized CTE, where the resolved subquery is deep-cloned per reference into the same query tree and the clones must stay distinguishable as column sources. `ColumnNode` keeps a snapshot of its source id next to the source weak pointer. The snapshot is updated by `setColumnSource` and by `cloneAndReplace` (via the new `onWeakPointerRemappedAfterClone` hook) when the source weak pointer is re-pointed to the clone of its previous target or to a node from the replacement map, and it survives the source node itself being destroyed. No behavior change: comparison and hashing do not use the ids yet. This prepares for distinguishing self-join column sources without comparing aliases and for splitting compare/hash into intra-tree and cross-tree variants. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Introduce `IQueryTreeNode::getGlobalTreeHash`: a tree hash that is canonical across query trees and across processes. It keeps the current `getTreeHash` algorithm (column sources referenced from the tree are hashed by content, with repeated visits replaced by visitation-order identifiers), and for now `getTreeHash` forwards to it, so there is no behavior change. Later `getTreeHash` will switch to comparing column source ids directly, which are process-local, and the call sites whose hashes must stay stable across trees, queries or servers are migrated to `getGlobalTreeHash` in advance: - prepared set keys: `CollectSets` and the corresponding lookups in `PlannerActionsVisitor` - distributed set names in `PlannerContext::createSetKey` and the `exists(...)` action node name in `PlannerActionsVisitor` (both can end up in a distributed query header, so they must match between the initiator and shards that build their own query trees) - subquery deduplication key in `buildQueryTreeForShard` - hash-join cache key in `ConcurrentHashJoin` (shared between queries) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace the single `isEqual`/`getTreeHash` with two explicit variants: - `isEqualGlobal`/`getTreeHashGlobal` keep the previous semantics exactly: column sources are compared by content (the referenced source subtree is traversed, with a correspondence map for repeated references), so the result is canonical across query trees and processes. This is used everywhere the previous code was, including all `QueryTreeNodeWithHash` containers (now spelled `...WithGlobalHash...`) and the cross-tree identifiers (`PreparedSets` keys, distributed set names, the hash-join cache key, per-shard subquery dedup). `getGlobalTreeHash` introduced in the previous commit is renamed to `getTreeHashGlobal` for symmetry. - `isEqualLocal`/`getTreeHashLocal` compare column sources by their column source id, so they are cheap and distinguish distinct source instances (the two sides of a self join), but are meaningful only within one query tree and its clones. `IColumnSourceNode::CompareOptions` no longer needs to express the source distinction through aliases, so the column source weak pointer machinery on `IQueryTreeNode` (the generic `weak_pointers` vector and the two argument constructor) is removed. `ColumnNode` now holds its column source weak pointer and an id snapshot directly; the snapshot is kept in sync by `setColumnSource` and, on cloning, by `remapColumnSourceAfterClone`. Cloning preserves column source ids by default, so a clone is the same logical source and compares equal to the original (needed because, for example, a GROUP BY key is cloned from the projection expression). The new `cloneWithFreshColumnSourceIds` is used where a clone is a genuinely separate instance that may coexist with the original in one tree, i.e. CTE expansion in `QueryAnalyzer`. Local comparison treats a column source that is internal to the compared subtree (reachable through children) by content, and an external source (the enclosing query's join tree) by id. This is positional, not node-type based, so lambdas, interpolate and subqueries used as expressions all match across independently built copies (alpha-equivalence) while self-join table expressions stay distinguished. Without it, a GROUP BY key written as a full expression containing a lambda or subquery would not match the equal projection expression (each resolved into a distinct instance), most visibly on a distributed shard that re-parses the GROUP BY expression. `ValidationUtils::compareGroupByKeys` now uses `isEqualLocal` and the helper `areColumnSourcesEqual` is removed: comparing the whole expression by column source id subsumes the previous two step compare that relied on source aliases. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

clickhouse-gh · 2026-06-15T16:11:31Z

`preCalculateCacheKey` and the `calculateCacheKey(std::shared_ptr<TableJoin> &, IQueryTreeNode::HashState)` overload in `ConcurrentHashJoin` have no callers anywhere. They were the cache key helpers of the logical join step and were left orphaned after `Revert "Lazily apply selector and replication indexes in join"`. This also removes the last use of `getTreeHashGlobal` from `ConcurrentHashJoin`. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

# Conflicts: # src/Analyzer/Passes/OptimizeGroupByFunctionKeysPass.cpp

`getCachedAST` keyed the per-node AST cache on `getTreeHash`, which hashes the constant's value column content on every `toAST` call. The generated AST depends only on the constant value, and the value column is replaced (not mutated) on changes such as `convertToNullable`, so the column pointer identifies it and is enough to invalidate the cache without hashing the (possibly large) column. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Switch comparisons and hash containers that operate within a single query tree from the global to the local variant (`isEqualLocal`/`getTreeHashLocal` and the `...WithLocalHash...` containers). Within one tree these are equivalent to the global variants but cheaper (no traversal of column sources), and they are more correct for self joins: e.g. deduplicating ORDER BY keys or matching GROUP BY keys now distinguishes `a.x` from `b.x` by column source id instead of collapsing them as structurally equal. Converted: the identifier resolver expression/USING/array-join comparisons, and the optimizer passes that compare or deduplicate expressions within one query (logical expression optimizer, group by key elimination, order by / limit by deduplication, grouping functions, fuse functions, etc.). Kept global where comparison is cross-tree or must be instance-agnostic: the self-join join-tree leniency in identifier resolution and `CrossToInnerJoinPass`/`LogicalExpressionOptimizerPass` join matching; the constraint passes (`ComparisonGraph`, `ConstraintsDescription`, `ConvertQueryToCNFPass`) which compare a constraint tree against the query tree; and the cross-tree/cross-server sites (prepared set keys, distributed set names, per-shard subquery dedup, hash-join cache, scalar caches). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A CTE referenced in the join tree (`FROM cte AS a, cte AS b`) resolves to the same shared CTE node, which `resolveQueryJoinTreeNode` then cloned with the id-preserving `clone`. As a result the two references shared column source ids, so `a.x` and `b.x` compared equal under local comparison. With the optimizer passes using local comparison this dropped conditions: e.g. in a self join of a CTE, `a.x <> 0 AND b.x <> 0` deduplicated to a single predicate, and the SEND+MORE=MONEY query (03274) returned 25 rows instead of one because `m.digit <> 0` was removed. Clone CTE references in the join tree with `cloneWithFreshColumnSourceIds`, matching what the expression-context CTE resolution already does. Plain tables are unaffected: each `FROM` entry resolves to its own table node. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

# Conflicts: # src/Analyzer/Resolve/QueryAnalyzer.cpp # src/Analyzer/Resolve/resolveFunction.cpp

scope.nullable_group_by_keys used the global hash map, so with group_by_use_nulls a projection column from one self-join side could match a structurally equal GROUP BY key from the other side and be replaced with its Nullable clone in resolveExpressionNode, before compareGroupByKeys applies local source-id semantics. For example SELECT r.number FROM numbers(2) AS l INNER JOIN numbers(2) AS r ON l.number = r.number GROUP BY l.number WITH ROLLUP let r.number match the l.number key, accepting a column that is not grouped. Make the map (and convertNestedGroupByKeysToNullable / subtreeContainsAnyNode that consume it) use local source-id comparison, so a key only matches a reference to the same source instance. Add a group_by_use_nulls self-join regression test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

areTreesEqual compared column sources by content using a one-directional record of visited pairs, so a single repeated source could compare equal to several distinct structurally equal sources. For example List(col(s), col(s)) and List(col(s1), col(s2)) compared equal, while computeTreeHash assigns a distinct visitation identifier per source and hashed them differently. Because QueryTreeNodeWithHash::operator== rejects unequal hashes before calling isEqual, such keys were missed in the hash keyed containers. Enforce a bijective source correspondence: a content-compared source is compared once on first sight, every later reference must map to the same counterpart, and a counterpart may be claimed only once (lhs_source_to_rhs maps each lhs source to its rhs counterpart; a set of already-claimed rhs sources rejects the reverse collision). This mirrors computeTreeHash's per-source identifiers and applies to global comparison and to local comparison of sources internal to the compared subtree. Add repeated-source unit tests for both. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The FROM table-identifier resolution path only minted fresh column source ids for CTE and materialized-CTE references, falling back to a plain clone (which preserves ids) otherwise. But the same path also resolves reused table-expression aliases registered by TableExpressionsAliasVisitor, e.g. SELECT r.number FROM (SELECT number FROM numbers(2)) AS l JOIN l AS r ON l.number = r.number GROUP BY l.number WITH ROLLUP Here `r` resolves to the same aliased subquery node as `l`, resolved_to_cte is false, and the preserved id made l.number and r.number share a local source id, so compareGroupByKeys accepted the non-grouped r.number as if it were grouped by l.number. Each join-tree reference is a distinct column source instance regardless of whether it is a CTE, a reused alias, or a plain table, so always clone with fresh column source ids. Extend the regression test with the alias-reuse case. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

# Conflicts: # src/Analyzer/Resolve/IdentifierResolveScope.cpp # src/Storages/StorageDistributed.cpp

clickhouse-gh · 2026-06-20T02:06:47Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	85.20%	85.20%	+0.00%
Functions	92.50%	92.50%	+0.00%
Branches	77.40%	77.40%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 725/763 (95.02%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 15 line(s) · Uncovered code

Full report · Diff report

novikd

I think that code can be simplified, and IColumnSourceNode can be removed completely.

novikd · 2026-06-22T14:33:05Z

    /// If the expression has extra parentheses around it in the original query
    bool parenthesized = false;
+    /// Set by IColumnSourceNode constructor
+    bool is_column_source = false;


Why is it needed?

Agree, it's better to override in the IColumnSourceNode

novikd · 2026-06-22T14:59:00Z

+ColumnNodePtrWithGlobalHashSet QueryNode::getCorrelatedColumnsSet() const
 {
-    ColumnNodePtrWithHashSet result;
+    ColumnNodePtrWithGlobalHashSet result;


I suspect local hash would be enough here, but it could be done in a separate PR

novikd · 2026-06-22T15:05:04Z

+    /** Column source instances are part of the comparison through column source ids, so t2.x and
+      * t1.x are different keys: SELECT t2.x FROM t1 JOIN t1 as t2 ON t1.x = t2.x GROUP BY t1.x;
+      */
+    return node->isEqualLocal(*group_by_key_node, {.compare_aliases = false});


Nice change

novikd · 2026-06-22T15:07:17Z

                                    && constraint_node->getColumnName() == node->getColumnName();
                            else
-                                return constraint_node->isEqual(*node);
+                                return constraint_node->isEqualGlobal(*node);


Also might be local

novikd · 2026-06-22T15:08:06Z

        collectSets(expression, *planner_context);

-        ColumnNodePtrWithHashSet empty_correlated_columns_set;
+        ColumnNodePtrWithGlobalHashSet empty_correlated_columns_set;


Can be local

novikd · 2026-06-22T15:13:55Z

+    /// The generated AST depends only on the constant value. The value column is replaced, not
+    /// mutated, on changes (e.g. `convertToNullable`), so its pointer identifies the value and is
+    /// enough to invalidate the cache, without hashing the (possibly large) column content.
+    hash_state.update(reinterpret_cast<std::uintptr_t>(constant_value.getColumn().get()));


What about the type?

novikd · 2026-06-22T16:20:30Z

+
+    /// Mutable so cloneAndReplace can preserve the original id on a clone (a freshly minted id is
+    /// assigned at construction and overwritten there unless fresh ids were explicitly requested).
+    ColumnSourceId column_source_id;


Why do we need it at all? Pointer can be used as an ID.

That's not exactly the same. By default, IQueryTreeNode::clone keeps ids, so that isEqualLocal works for cloned subtrees on the same tree (I think GROUP BY keys is a good example; they can contain lambdas, which are column sources).

novikd · 2026-06-22T16:22:12Z

+    column_source_id = extractColumnSourceId(source, column);
+}
+
+void ColumnNode::remapColumnSourceAfterClone(const ReplacementMap & old_pointer_to_new_pointer)


This function would not be needed if a pointer was used as an ID.

No, it is not... If the source was cloned, we definitely need to update the reference.

Technically, it's possible to remove if we handle specially ColumnNode in clone.

novikd · 2026-06-22T16:34:39Z

+      * Column references are compared by their column source ids, i.e. two columns are equal only
+      * if they reference the same column source instance. Because of that the result is meaningful
+      * only for trees from the same id space: the same query tree, or its clones (a clone of a
+      * column keeps referencing the original source unless the source was cloned with it). For
+      * trees built independently use isEqualGlobal.
+      *
+      * With default compare options aliases of query tree nodes are compared.
+      * Original ASTs of query tree nodes are not compared.
      */
-    bool isEqual(const IQueryTreeNode & rhs, CompareOptions compare_options = { .compare_aliases = true, .ignore_cte = false }) const;
+    bool isEqualLocal(const IQueryTreeNode & rhs, CompareOptions compare_options = { .compare_aliases = true, .ignore_cte = false }) const;
+
+    /** Is tree equal to other tree with node root, with equality canonical across query trees.
+      *
+      * Unlike isEqualLocal, column references are compared by the content of their column sources,
+      * with repeated references handled through a correspondence map, so structurally equal trees
+      * built independently compare equal. Traverses into column sources, so it is more expensive
+      * than isEqualLocal.
+      */
+    bool isEqualGlobal(const IQueryTreeNode & rhs, CompareOptions compare_options = { .compare_aliases = true, .ignore_cte = false }) const;


I think that it can be improved. We can avoid a deep comparison of source table expressions altogether:

In local mode, the source table expression pointer can be used as ID

In global mode, the source table expression ID can be calculated on the fly

For the global case, there should be 2 functions <op>Global and <op>GlobalImpl.

<op>Global will create a map from table expression ref to ID and then call <op>GlobalImpl.
We can guarantee correctness because we have a deterministic traversal order.

KochetovNicolai and others added 4 commits June 12, 2026 15:46

Merge branch 'master' into group-by-use-nulls-typed-comparison

0ec12be

clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 15, 2026

clickhouse-gh Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread src/Analyzer/Resolve/IdentifierResolveScope.h Outdated

clickhouse-gh Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread src/Analyzer/IQueryTreeNode.cpp Outdated

clickhouse-gh Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp

KochetovNicolai added 3 commits June 16, 2026 11:17

Merge remote-tracking branch 'origin/master' into column-source-ids

31a0662

# Conflicts: # src/Analyzer/Passes/OptimizeGroupByFunctionKeysPass.cpp

Fixing build

128eac5

Fixes after merge

0bb4b2c

KochetovNicolai force-pushed the column-source-ids branch from a45a784 to 0bb4b2c Compare June 16, 2026 12:41

KochetovNicolai and others added 6 commits June 16, 2026 16:58

Merge remote-tracking branch 'origin/master' into column-source-ids

7f294d9

# Conflicts: # src/Analyzer/Resolve/QueryAnalyzer.cpp # src/Analyzer/Resolve/resolveFunction.cpp

clickhouse-gh Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated

KochetovNicolai marked this pull request as ready for review June 18, 2026 11:18

novikd self-assigned this Jun 18, 2026

Merge remote-tracking branch 'origin/master' into column-source-ids

5dd7a93

# Conflicts: # src/Analyzer/Resolve/IdentifierResolveScope.cpp # src/Storages/StorageDistributed.cpp

novikd reviewed Jun 22, 2026

View reviewed changes

groeneai mentioned this pull request Jun 25, 2026

Analyzer: enable_identifier_resolve_cache shares a mutated expression node, breaking compound GROUP BY alias (NOT_FOUND_COLUMN_IN_BLOCK) #108518

Closed

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

KochetovNicolai commented Jun 15, 2026

Changelog category (leave one):

Uh oh!

clickhouse-gh Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented Jun 20, 2026

LLVM Coverage Report

Uh oh!

novikd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clickhouse-gh Bot commented Jun 15, 2026 •

edited

Loading