Introduce column source ids and split query tree comparison into local and global#107535
Introduce column source ids and split query tree comparison into local and global#107535KochetovNicolai wants to merge 16 commits into
Conversation
Introduce `IColumnSourceNode`, a base class for query tree nodes that can be a column source: `TableNode`, `TableFunctionNode`, `QueryNode`, `UnionNode`, `ArrayJoinNode`, `JoinNode` (for columns from the USING clause), `InterpolateNode`, `LambdaNode`. Each instance gets a `ColumnSourceId` unique within the process, assigned at construction, so a clone is a new source instance with a fresh id. This matters for a self join of a materialized CTE, where the resolved subquery is deep-cloned per reference into the same query tree and the clones must stay distinguishable as column sources. `ColumnNode` keeps a snapshot of its source id next to the source weak pointer. The snapshot is updated by `setColumnSource` and by `cloneAndReplace` (via the new `onWeakPointerRemappedAfterClone` hook) when the source weak pointer is re-pointed to the clone of its previous target or to a node from the replacement map, and it survives the source node itself being destroyed. No behavior change: comparison and hashing do not use the ids yet. This prepares for distinguishing self-join column sources without comparing aliases and for splitting compare/hash into intra-tree and cross-tree variants. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Introduce `IQueryTreeNode::getGlobalTreeHash`: a tree hash that is canonical across query trees and across processes. It keeps the current `getTreeHash` algorithm (column sources referenced from the tree are hashed by content, with repeated visits replaced by visitation-order identifiers), and for now `getTreeHash` forwards to it, so there is no behavior change. Later `getTreeHash` will switch to comparing column source ids directly, which are process-local, and the call sites whose hashes must stay stable across trees, queries or servers are migrated to `getGlobalTreeHash` in advance: - prepared set keys: `CollectSets` and the corresponding lookups in `PlannerActionsVisitor` - distributed set names in `PlannerContext::createSetKey` and the `exists(...)` action node name in `PlannerActionsVisitor` (both can end up in a distributed query header, so they must match between the initiator and shards that build their own query trees) - subquery deduplication key in `buildQueryTreeForShard` - hash-join cache key in `ConcurrentHashJoin` (shared between queries) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the single `isEqual`/`getTreeHash` with two explicit variants: - `isEqualGlobal`/`getTreeHashGlobal` keep the previous semantics exactly: column sources are compared by content (the referenced source subtree is traversed, with a correspondence map for repeated references), so the result is canonical across query trees and processes. This is used everywhere the previous code was, including all `QueryTreeNodeWithHash` containers (now spelled `...WithGlobalHash...`) and the cross-tree identifiers (`PreparedSets` keys, distributed set names, the hash-join cache key, per-shard subquery dedup). `getGlobalTreeHash` introduced in the previous commit is renamed to `getTreeHashGlobal` for symmetry. - `isEqualLocal`/`getTreeHashLocal` compare column sources by their column source id, so they are cheap and distinguish distinct source instances (the two sides of a self join), but are meaningful only within one query tree and its clones. `IColumnSourceNode::CompareOptions` no longer needs to express the source distinction through aliases, so the column source weak pointer machinery on `IQueryTreeNode` (the generic `weak_pointers` vector and the two argument constructor) is removed. `ColumnNode` now holds its column source weak pointer and an id snapshot directly; the snapshot is kept in sync by `setColumnSource` and, on cloning, by `remapColumnSourceAfterClone`. Cloning preserves column source ids by default, so a clone is the same logical source and compares equal to the original (needed because, for example, a GROUP BY key is cloned from the projection expression). The new `cloneWithFreshColumnSourceIds` is used where a clone is a genuinely separate instance that may coexist with the original in one tree, i.e. CTE expansion in `QueryAnalyzer`. Local comparison treats a column source that is internal to the compared subtree (reachable through children) by content, and an external source (the enclosing query's join tree) by id. This is positional, not node-type based, so lambdas, interpolate and subqueries used as expressions all match across independently built copies (alpha-equivalence) while self-join table expressions stay distinguished. Without it, a GROUP BY key written as a full expression containing a lambda or subquery would not match the equal projection expression (each resolved into a distinct instance), most visibly on a distributed shard that re-parses the GROUP BY expression. `ValidationUtils::compareGroupByKeys` now uses `isEqualLocal` and the helper `areColumnSourcesEqual` is removed: comparing the whole expression by column source id subsumes the previous two step compare that relied on source aliases. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`preCalculateCacheKey` and the `calculateCacheKey(std::shared_ptr<TableJoin> &, IQueryTreeNode::HashState)` overload in `ConcurrentHashJoin` have no callers anywhere. They were the cache key helpers of the logical join step and were left orphaned after `Revert "Lazily apply selector and replication indexes in join"`. This also removes the last use of `getTreeHashGlobal` from `ConcurrentHashJoin`. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts: # src/Analyzer/Passes/OptimizeGroupByFunctionKeysPass.cpp
a45a784 to
0bb4b2c
Compare
`getCachedAST` keyed the per-node AST cache on `getTreeHash`, which hashes the constant's value column content on every `toAST` call. The generated AST depends only on the constant value, and the value column is replaced (not mutated) on changes such as `convertToNullable`, so the column pointer identifies it and is enough to invalidate the cache without hashing the (possibly large) column. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Switch comparisons and hash containers that operate within a single query tree from the global to the local variant (`isEqualLocal`/`getTreeHashLocal` and the `...WithLocalHash...` containers). Within one tree these are equivalent to the global variants but cheaper (no traversal of column sources), and they are more correct for self joins: e.g. deduplicating ORDER BY keys or matching GROUP BY keys now distinguishes `a.x` from `b.x` by column source id instead of collapsing them as structurally equal. Converted: the identifier resolver expression/USING/array-join comparisons, and the optimizer passes that compare or deduplicate expressions within one query (logical expression optimizer, group by key elimination, order by / limit by deduplication, grouping functions, fuse functions, etc.). Kept global where comparison is cross-tree or must be instance-agnostic: the self-join join-tree leniency in identifier resolution and `CrossToInnerJoinPass`/`LogicalExpressionOptimizerPass` join matching; the constraint passes (`ComparisonGraph`, `ConstraintsDescription`, `ConvertQueryToCNFPass`) which compare a constraint tree against the query tree; and the cross-tree/cross-server sites (prepared set keys, distributed set names, per-shard subquery dedup, hash-join cache, scalar caches). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A CTE referenced in the join tree (`FROM cte AS a, cte AS b`) resolves to the same shared CTE node, which `resolveQueryJoinTreeNode` then cloned with the id-preserving `clone`. As a result the two references shared column source ids, so `a.x` and `b.x` compared equal under local comparison. With the optimizer passes using local comparison this dropped conditions: e.g. in a self join of a CTE, `a.x <> 0 AND b.x <> 0` deduplicated to a single predicate, and the SEND+MORE=MONEY query (03274) returned 25 rows instead of one because `m.digit <> 0` was removed. Clone CTE references in the join tree with `cloneWithFreshColumnSourceIds`, matching what the expression-context CTE resolution already does. Plain tables are unaffected: each `FROM` entry resolves to its own table node. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts: # src/Analyzer/Resolve/QueryAnalyzer.cpp # src/Analyzer/Resolve/resolveFunction.cpp
scope.nullable_group_by_keys used the global hash map, so with group_by_use_nulls a projection column from one self-join side could match a structurally equal GROUP BY key from the other side and be replaced with its Nullable clone in resolveExpressionNode, before compareGroupByKeys applies local source-id semantics. For example SELECT r.number FROM numbers(2) AS l INNER JOIN numbers(2) AS r ON l.number = r.number GROUP BY l.number WITH ROLLUP let r.number match the l.number key, accepting a column that is not grouped. Make the map (and convertNestedGroupByKeysToNullable / subtreeContainsAnyNode that consume it) use local source-id comparison, so a key only matches a reference to the same source instance. Add a group_by_use_nulls self-join regression test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
areTreesEqual compared column sources by content using a one-directional record of visited pairs, so a single repeated source could compare equal to several distinct structurally equal sources. For example List(col(s), col(s)) and List(col(s1), col(s2)) compared equal, while computeTreeHash assigns a distinct visitation identifier per source and hashed them differently. Because QueryTreeNodeWithHash::operator== rejects unequal hashes before calling isEqual, such keys were missed in the hash keyed containers. Enforce a bijective source correspondence: a content-compared source is compared once on first sight, every later reference must map to the same counterpart, and a counterpart may be claimed only once (lhs_source_to_rhs maps each lhs source to its rhs counterpart; a set of already-claimed rhs sources rejects the reverse collision). This mirrors computeTreeHash's per-source identifiers and applies to global comparison and to local comparison of sources internal to the compared subtree. Add repeated-source unit tests for both. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The FROM table-identifier resolution path only minted fresh column source ids for CTE and materialized-CTE references, falling back to a plain clone (which preserves ids) otherwise. But the same path also resolves reused table-expression aliases registered by TableExpressionsAliasVisitor, e.g. SELECT r.number FROM (SELECT number FROM numbers(2)) AS l JOIN l AS r ON l.number = r.number GROUP BY l.number WITH ROLLUP Here `r` resolves to the same aliased subquery node as `l`, resolved_to_cte is false, and the preserved id made l.number and r.number share a local source id, so compareGroupByKeys accepted the non-grouped r.number as if it were grouped by l.number. Each join-tree reference is a distinct column source instance regardless of whether it is a CTE, a reused alias, or a plain table, so always clone with fresh column source ids. Extend the regression test with the alias-reuse case. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts: # src/Analyzer/Resolve/IdentifierResolveScope.cpp # src/Storages/StorageDistributed.cpp
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 725/763 (95.02%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 15 line(s) · Uncovered code |
novikd
left a comment
There was a problem hiding this comment.
I think that code can be simplified, and IColumnSourceNode can be removed completely.
| /// If the expression has extra parentheses around it in the original query | ||
| bool parenthesized = false; | ||
| /// Set by IColumnSourceNode constructor | ||
| bool is_column_source = false; |
There was a problem hiding this comment.
Agree, it's better to override in the IColumnSourceNode
| ColumnNodePtrWithGlobalHashSet QueryNode::getCorrelatedColumnsSet() const | ||
| { | ||
| ColumnNodePtrWithHashSet result; | ||
| ColumnNodePtrWithGlobalHashSet result; |
There was a problem hiding this comment.
I suspect local hash would be enough here, but it could be done in a separate PR
| /** Column source instances are part of the comparison through column source ids, so t2.x and | ||
| * t1.x are different keys: SELECT t2.x FROM t1 JOIN t1 as t2 ON t1.x = t2.x GROUP BY t1.x; | ||
| */ | ||
| return node->isEqualLocal(*group_by_key_node, {.compare_aliases = false}); |
| && constraint_node->getColumnName() == node->getColumnName(); | ||
| else | ||
| return constraint_node->isEqual(*node); | ||
| return constraint_node->isEqualGlobal(*node); |
| collectSets(expression, *planner_context); | ||
|
|
||
| ColumnNodePtrWithHashSet empty_correlated_columns_set; | ||
| ColumnNodePtrWithGlobalHashSet empty_correlated_columns_set; |
| /// The generated AST depends only on the constant value. The value column is replaced, not | ||
| /// mutated, on changes (e.g. `convertToNullable`), so its pointer identifies the value and is | ||
| /// enough to invalidate the cache, without hashing the (possibly large) column content. | ||
| hash_state.update(reinterpret_cast<std::uintptr_t>(constant_value.getColumn().get())); |
|
|
||
| /// Mutable so cloneAndReplace can preserve the original id on a clone (a freshly minted id is | ||
| /// assigned at construction and overwritten there unless fresh ids were explicitly requested). | ||
| ColumnSourceId column_source_id; |
There was a problem hiding this comment.
Why do we need it at all? Pointer can be used as an ID.
There was a problem hiding this comment.
That's not exactly the same. By default, IQueryTreeNode::clone keeps ids, so that isEqualLocal works for cloned subtrees on the same tree (I think GROUP BY keys is a good example; they can contain lambdas, which are column sources).
| column_source_id = extractColumnSourceId(source, column); | ||
| } | ||
|
|
||
| void ColumnNode::remapColumnSourceAfterClone(const ReplacementMap & old_pointer_to_new_pointer) |
There was a problem hiding this comment.
This function would not be needed if a pointer was used as an ID.
There was a problem hiding this comment.
No, it is not... If the source was cloned, we definitely need to update the reference.
Technically, it's possible to remove if we handle specially ColumnNode in clone.
There was a problem hiding this comment.
I think that it can be improved. We can avoid a deep comparison of source table expressions altogether:
- In local mode, the source table expression pointer can be used as ID
- In global mode, the source table expression ID can be calculated on the fly
For the global case, there should be 2 functions <op>Global and <op>GlobalImpl.
<op>Global will create a map from table expression ref to ID and then call <op>GlobalImpl.
We can guarantee correctness because we have a deterministic traversal order.

Claude
What changes:
isEqualGlobal/getTreeHashGlobalkeep the previous semantics exactly — columnsources are compared by content, so the result is canonical across query trees and
processes. Every existing call site keeps this behavior (all
QueryTreeNodeWithHashcontainers, now spelled
...WithGlobalHash..., and the cross-tree identifiers:PreparedSetskeys, distributed set names, the hash-join cache key, per-shard subquerydeduplication).
isEqualLocal/getTreeHashLocalcompare column sources by id: cheap, instance-aware,and meaningful within a single query tree and its clones. Used by
ValidationUtils::compareGroupByKeys, so a self join'sGROUP BYkeys are nowdistinguished by source id instead of by alias.
an interpolate, or a subquery used as an expression) is compared by content even in
local mode; an external source (the enclosing query's join tree) is compared by id. This
is positional rather than node-type based, so a
GROUP BYkey written as a fullexpression matches the equal projection expression even when each was resolved
independently into a distinct lambda or subquery instance (most visibly on a distributed
shard that re-parses the
GROUP BYexpression).compares equal to the original.
cloneWithFreshColumnSourceIdsis used where a clone is agenuinely separate instance that may coexist with the original in one tree — CTE expansion
in
QueryAnalyzer.There is no user-visible behavior change: existing sites keep global (previous) semantics,
and switching
GROUP BYvalidation to local id comparison is equivalent to the previousalias-based comparison for valid queries.
This PR is stacked on the
group_by_use_nullstyped-comparison branch (it builds on theremoval of the
compare_typesoption), so it should be reviewed/merged after it.Changelog category (leave one):