Introduce column source ids and split query tree comparison into local and global by KochetovNicolai · Pull Request #107535 · ClickHouse/ClickHouse · GitHub
Skip to content

Introduce column source ids and split query tree comparison into local and global#107535

Open
KochetovNicolai wants to merge 16 commits into
masterfrom
column-source-ids
Open

Introduce column source ids and split query tree comparison into local and global#107535
KochetovNicolai wants to merge 16 commits into
masterfrom
column-source-ids

Conversation

@KochetovNicolai

Copy link
Copy Markdown
Member
Claude

What changes:

  • isEqualGlobal / getTreeHashGlobal keep the previous semantics exactly — column
    sources are compared by content, so the result is canonical across query trees and
    processes. Every existing call site keeps this behavior (all QueryTreeNodeWithHash
    containers, now spelled ...WithGlobalHash..., and the cross-tree identifiers:
    PreparedSets keys, distributed set names, the hash-join cache key, per-shard subquery
    deduplication).
  • isEqualLocal / getTreeHashLocal compare column sources by id: cheap, instance-aware,
    and meaningful within a single query tree and its clones. Used by
    ValidationUtils::compareGroupByKeys, so a self join's GROUP BY keys are now
    distinguished by source id instead of by alias.
  • A column source internal to the compared subtree (reachable through children — a lambda,
    an interpolate, or a subquery used as an expression) is compared by content even in
    local mode; an external source (the enclosing query's join tree) is compared by id. This
    is positional rather than node-type based, so a GROUP BY key written as a full
    expression matches the equal projection expression even when each was resolved
    independently into a distinct lambda or subquery instance (most visibly on a distributed
    shard that re-parses the GROUP BY expression).
  • Cloning preserves column source ids by default, so a clone is the same logical source and
    compares equal to the original. cloneWithFreshColumnSourceIds is used where a clone is a
    genuinely separate instance that may coexist with the original in one tree — CTE expansion
    in QueryAnalyzer.

There is no user-visible behavior change: existing sites keep global (previous) semantics,
and switching GROUP BY validation to local id comparison is equivalent to the previous
alias-based comparison for valid queries.

This PR is stacked on the group_by_use_nulls typed-comparison branch (it builds on the
removal of the compare_types option), so it should be reviewed/merged after it.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

KochetovNicolai and others added 4 commits June 12, 2026 15:46
Introduce `IColumnSourceNode`, a base class for query tree nodes that can
be a column source: `TableNode`, `TableFunctionNode`, `QueryNode`,
`UnionNode`, `ArrayJoinNode`, `JoinNode` (for columns from the USING
clause), `InterpolateNode`, `LambdaNode`. Each instance gets a
`ColumnSourceId` unique within the process, assigned at construction, so
a clone is a new source instance with a fresh id. This matters for a self
join of a materialized CTE, where the resolved subquery is deep-cloned
per reference into the same query tree and the clones must stay
distinguishable as column sources.

`ColumnNode` keeps a snapshot of its source id next to the source weak
pointer. The snapshot is updated by `setColumnSource` and by
`cloneAndReplace` (via the new `onWeakPointerRemappedAfterClone` hook)
when the source weak pointer is re-pointed to the clone of its previous
target or to a node from the replacement map, and it survives the source
node itself being destroyed.

No behavior change: comparison and hashing do not use the ids yet. This
prepares for distinguishing self-join column sources without comparing
aliases and for splitting compare/hash into intra-tree and cross-tree
variants.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Introduce `IQueryTreeNode::getGlobalTreeHash`: a tree hash that is
canonical across query trees and across processes. It keeps the current
`getTreeHash` algorithm (column sources referenced from the tree are
hashed by content, with repeated visits replaced by visitation-order
identifiers), and for now `getTreeHash` forwards to it, so there is no
behavior change. Later `getTreeHash` will switch to comparing column
source ids directly, which are process-local, and the call sites whose
hashes must stay stable across trees, queries or servers are migrated to
`getGlobalTreeHash` in advance:

- prepared set keys: `CollectSets` and the corresponding lookups in
  `PlannerActionsVisitor`
- distributed set names in `PlannerContext::createSetKey` and the
  `exists(...)` action node name in `PlannerActionsVisitor` (both can end
  up in a distributed query header, so they must match between the
  initiator and shards that build their own query trees)
- subquery deduplication key in `buildQueryTreeForShard`
- hash-join cache key in `ConcurrentHashJoin` (shared between queries)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the single `isEqual`/`getTreeHash` with two explicit variants:

- `isEqualGlobal`/`getTreeHashGlobal` keep the previous semantics exactly:
  column sources are compared by content (the referenced source subtree is
  traversed, with a correspondence map for repeated references), so the
  result is canonical across query trees and processes. This is used
  everywhere the previous code was, including all `QueryTreeNodeWithHash`
  containers (now spelled `...WithGlobalHash...`) and the cross-tree
  identifiers (`PreparedSets` keys, distributed set names, the hash-join
  cache key, per-shard subquery dedup). `getGlobalTreeHash` introduced in
  the previous commit is renamed to `getTreeHashGlobal` for symmetry.

- `isEqualLocal`/`getTreeHashLocal` compare column sources by their column
  source id, so they are cheap and distinguish distinct source instances
  (the two sides of a self join), but are meaningful only within one query
  tree and its clones.

`IColumnSourceNode::CompareOptions` no longer needs to express the source
distinction through aliases, so the column source weak pointer machinery on
`IQueryTreeNode` (the generic `weak_pointers` vector and the two argument
constructor) is removed. `ColumnNode` now holds its column source weak
pointer and an id snapshot directly; the snapshot is kept in sync by
`setColumnSource` and, on cloning, by `remapColumnSourceAfterClone`.

Cloning preserves column source ids by default, so a clone is the same
logical source and compares equal to the original (needed because, for
example, a GROUP BY key is cloned from the projection expression). The new
`cloneWithFreshColumnSourceIds` is used where a clone is a genuinely
separate instance that may coexist with the original in one tree, i.e. CTE
expansion in `QueryAnalyzer`.

Local comparison treats a column source that is internal to the compared
subtree (reachable through children) by content, and an external source
(the enclosing query's join tree) by id. This is positional, not node-type
based, so lambdas, interpolate and subqueries used as expressions all match
across independently built copies (alpha-equivalence) while self-join table
expressions stay distinguished. Without it, a GROUP BY key written as a
full expression containing a lambda or subquery would not match the equal
projection expression (each resolved into a distinct instance), most
visibly on a distributed shard that re-parses the GROUP BY expression.

`ValidationUtils::compareGroupByKeys` now uses `isEqualLocal` and the
helper `areColumnSourcesEqual` is removed: comparing the whole expression
by column source id subsumes the previous two step compare that relied on
source aliases.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 15, 2026
Comment thread src/Analyzer/Resolve/IdentifierResolveScope.h Outdated
`preCalculateCacheKey` and the `calculateCacheKey(std::shared_ptr<TableJoin> &,
IQueryTreeNode::HashState)` overload in `ConcurrentHashJoin` have no callers
anywhere. They were the cache key helpers of the logical join step and were left
orphaned after `Revert "Lazily apply selector and replication indexes in join"`.

This also removes the last use of `getTreeHashGlobal` from `ConcurrentHashJoin`.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread src/Analyzer/IQueryTreeNode.cpp Outdated
Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp
KochetovNicolai and others added 6 commits June 16, 2026 16:58
`getCachedAST` keyed the per-node AST cache on `getTreeHash`, which hashes
the constant's value column content on every `toAST` call. The generated
AST depends only on the constant value, and the value column is replaced
(not mutated) on changes such as `convertToNullable`, so the column
pointer identifies it and is enough to invalidate the cache without
hashing the (possibly large) column.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Switch comparisons and hash containers that operate within a single query
tree from the global to the local variant (`isEqualLocal`/`getTreeHashLocal`
and the `...WithLocalHash...` containers). Within one tree these are
equivalent to the global variants but cheaper (no traversal of column
sources), and they are more correct for self joins: e.g. deduplicating
ORDER BY keys or matching GROUP BY keys now distinguishes `a.x` from `b.x`
by column source id instead of collapsing them as structurally equal.

Converted: the identifier resolver expression/USING/array-join comparisons,
and the optimizer passes that compare or deduplicate expressions within one
query (logical expression optimizer, group by key elimination, order by /
limit by deduplication, grouping functions, fuse functions, etc.).

Kept global where comparison is cross-tree or must be instance-agnostic:
the self-join join-tree leniency in identifier resolution and
`CrossToInnerJoinPass`/`LogicalExpressionOptimizerPass` join matching; the
constraint passes (`ComparisonGraph`, `ConstraintsDescription`,
`ConvertQueryToCNFPass`) which compare a constraint tree against the query
tree; and the cross-tree/cross-server sites (prepared set keys, distributed
set names, per-shard subquery dedup, hash-join cache, scalar caches).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A CTE referenced in the join tree (`FROM cte AS a, cte AS b`) resolves to
the same shared CTE node, which `resolveQueryJoinTreeNode` then cloned with
the id-preserving `clone`. As a result the two references shared column
source ids, so `a.x` and `b.x` compared equal under local comparison. With
the optimizer passes using local comparison this dropped conditions: e.g.
in a self join of a CTE, `a.x <> 0 AND b.x <> 0` deduplicated to a single
predicate, and the SEND+MORE=MONEY query (03274) returned 25 rows instead
of one because `m.digit <> 0` was removed.

Clone CTE references in the join tree with `cloneWithFreshColumnSourceIds`,
matching what the expression-context CTE resolution already does. Plain
tables are unaffected: each `FROM` entry resolves to its own table node.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts:
#	src/Analyzer/Resolve/QueryAnalyzer.cpp
#	src/Analyzer/Resolve/resolveFunction.cpp
scope.nullable_group_by_keys used the global hash map, so with
group_by_use_nulls a projection column from one self-join side could match
a structurally equal GROUP BY key from the other side and be replaced with
its Nullable clone in resolveExpressionNode, before compareGroupByKeys
applies local source-id semantics. For example

  SELECT r.number FROM numbers(2) AS l INNER JOIN numbers(2) AS r
  ON l.number = r.number GROUP BY l.number WITH ROLLUP

let r.number match the l.number key, accepting a column that is not grouped.

Make the map (and convertNestedGroupByKeysToNullable / subtreeContainsAnyNode
that consume it) use local source-id comparison, so a key only matches a
reference to the same source instance. Add a group_by_use_nulls self-join
regression test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
areTreesEqual compared column sources by content using a one-directional
record of visited pairs, so a single repeated source could compare equal to
several distinct structurally equal sources. For example
List(col(s), col(s)) and List(col(s1), col(s2)) compared equal, while
computeTreeHash assigns a distinct visitation identifier per source and
hashed them differently. Because QueryTreeNodeWithHash::operator== rejects
unequal hashes before calling isEqual, such keys were missed in the hash
keyed containers.

Enforce a bijective source correspondence: a content-compared source is
compared once on first sight, every later reference must map to the same
counterpart, and a counterpart may be claimed only once (lhs_source_to_rhs
maps each lhs source to its rhs counterpart; a set of already-claimed rhs
sources rejects the reverse collision). This mirrors computeTreeHash's
per-source identifiers and applies to global comparison and to local
comparison of sources internal to the compared subtree. Add repeated-source
unit tests for both.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated
The FROM table-identifier resolution path only minted fresh column source
ids for CTE and materialized-CTE references, falling back to a plain clone
(which preserves ids) otherwise. But the same path also resolves reused
table-expression aliases registered by TableExpressionsAliasVisitor, e.g.

  SELECT r.number
  FROM (SELECT number FROM numbers(2)) AS l JOIN l AS r ON l.number = r.number
  GROUP BY l.number WITH ROLLUP

Here `r` resolves to the same aliased subquery node as `l`, resolved_to_cte
is false, and the preserved id made l.number and r.number share a local
source id, so compareGroupByKeys accepted the non-grouped r.number as if it
were grouped by l.number.

Each join-tree reference is a distinct column source instance regardless of
whether it is a CTE, a reused alias, or a plain table, so always clone with
fresh column source ids. Extend the regression test with the alias-reuse
case.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@KochetovNicolai KochetovNicolai marked this pull request as ready for review June 18, 2026 11:18
@novikd novikd self-assigned this Jun 18, 2026
# Conflicts:
#	src/Analyzer/Resolve/IdentifierResolveScope.cpp
#	src/Storages/StorageDistributed.cpp
@clickhouse-gh

clickhouse-gh Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.20% 85.20% +0.00%
Functions 92.50% 92.50% +0.00%
Branches 77.40% 77.40% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 725/763 (95.02%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 15 line(s) · Uncovered code

Full report · Diff report

@novikd novikd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that code can be simplified, and IColumnSourceNode can be removed completely.

/// If the expression has extra parentheses around it in the original query
bool parenthesized = false;
/// Set by IColumnSourceNode constructor
bool is_column_source = false;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it's better to override in the IColumnSourceNode

ColumnNodePtrWithGlobalHashSet QueryNode::getCorrelatedColumnsSet() const
{
ColumnNodePtrWithHashSet result;
ColumnNodePtrWithGlobalHashSet result;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect local hash would be enough here, but it could be done in a separate PR

/** Column source instances are part of the comparison through column source ids, so t2.x and
* t1.x are different keys: SELECT t2.x FROM t1 JOIN t1 as t2 ON t1.x = t2.x GROUP BY t1.x;
*/
return node->isEqualLocal(*group_by_key_node, {.compare_aliases = false});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change

&& constraint_node->getColumnName() == node->getColumnName();
else
return constraint_node->isEqual(*node);
return constraint_node->isEqualGlobal(*node);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also might be local

collectSets(expression, *planner_context);

ColumnNodePtrWithHashSet empty_correlated_columns_set;
ColumnNodePtrWithGlobalHashSet empty_correlated_columns_set;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be local

/// The generated AST depends only on the constant value. The value column is replaced, not
/// mutated, on changes (e.g. `convertToNullable`), so its pointer identifies the value and is
/// enough to invalidate the cache, without hashing the (possibly large) column content.
hash_state.update(reinterpret_cast<std::uintptr_t>(constant_value.getColumn().get()));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the type?


/// Mutable so cloneAndReplace can preserve the original id on a clone (a freshly minted id is
/// assigned at construction and overwritten there unless fresh ids were explicitly requested).
ColumnSourceId column_source_id;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it at all? Pointer can be used as an ID.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not exactly the same. By default, IQueryTreeNode::clone keeps ids, so that isEqualLocal works for cloned subtrees on the same tree (I think GROUP BY keys is a good example; they can contain lambdas, which are column sources).

column_source_id = extractColumnSourceId(source, column);
}

void ColumnNode::remapColumnSourceAfterClone(const ReplacementMap & old_pointer_to_new_pointer)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function would not be needed if a pointer was used as an ID.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not... If the source was cloned, we definitely need to update the reference.

Technically, it's possible to remove if we handle specially ColumnNode in clone.

Comment on lines +141 to +159

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it can be improved. We can avoid a deep comparison of source table expressions altogether:

  • In local mode, the source table expression pointer can be used as ID
  • In global mode, the source table expression ID can be calculated on the fly

For the global case, there should be 2 functions <op>Global and <op>GlobalImpl.

<op>Global will create a map from table expression ref to ID and then call <op>GlobalImpl.
We can guarantee correctness because we have a deterministic traversal order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants