Fix duplicate ALIAS columns causing column count mismatch in Distributed by Onyx2406 · Pull Request #101789 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix duplicate ALIAS columns causing column count mismatch in Distributed#101789

Open
Onyx2406 wants to merge 13 commits into
ClickHouse:masterfrom
Onyx2406:fix/distributed-duplicate-alias-columns
Open

Fix duplicate ALIAS columns causing column count mismatch in Distributed#101789
Onyx2406 wants to merge 13 commits into
ClickHouse:masterfrom
Onyx2406:fix/distributed-duplicate-alias-columns

Conversation

@Onyx2406

@Onyx2406 Onyx2406 commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

What this PR does

When several ALIAS columns of one table share the same expression (e.g. d Float64 ALIAS b + c, e Float64 ALIAS b + c), the query rewrite for Distributed tables expanded them into identical expression trees with identical action names, the planner collapsed them into a single column of the block sent over the network, and the positional conversion against the header expected by the initiator failed with NUMBER_OF_COLUMNS_DOESNT_MATCH.

The fix wraps such expansions into the internal function __actionName(expression, 'name'), which exists exactly for pinning planner action names (see PlannerActionsVisitor) and is a no-op at runtime. Every ALIAS column gets its own action name, so any number of duplicates stay distinct, while all occurrences of one ALIAS column resolve to the same name and are correctly recognized as one column. The duplicate detection is a deterministic two-pass traversal scoped per column source. Occurrences in WHERE/PREWHERE are expanded bare, so primary key and skipping index analysis on the shards is unaffected. The first argument of __actionName (returned unchanged) may now have any type; the name argument must still be a constant, non-empty, genuine String.

Closes: #85895

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix NUMBER_OF_COLUMNS_DOESNT_MATCH error when querying a Distributed table with several ALIAS columns sharing the same expression.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Onyx2406 added 3 commits April 4, 2026 18:47
When two ALIAS columns share the same expression, the Planner
deduplicated them causing a column count mismatch. Wrap duplicate
ALIAS expressions in `identity()` to keep them distinct while
preserving semantics.

Fixes ClickHouse#85895
Rename `04073_distributed_duplicate_alias_columns` to `04079` since
number 04073 is already taken on master by other tests.
… avoid number collision with upstream master
Comment thread src/Storages/StorageDistributed.cpp Outdated
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Apr 5, 2026
@clickhouse-gh

clickhouse-gh Bot commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Apr 5, 2026
Comment thread src/Storages/StorageDistributed.cpp Outdated
…NS_FOR_ALIAS`

The previous revision decided per *visit* whether to wrap an expanded
`ALIAS` column in `identity`: the first time an expression hash was seen
it stayed bare, and every later visit wrapped. That broke ordinary
queries that reference the same `ALIAS` column more than once (e.g.
`SELECT user_level, grouping(user_level) GROUP BY user_level`), because
the SELECT occurrence expanded to `multiIf(...)` while the GROUP BY
occurrence expanded to `identity(multiIf(...))`, and the Planner raised
`MULTIPLE_EXPRESSIONS_FOR_ALIAS` for the single alias now bound to two
different expressions. Regressions were visible on
`03259_grouping_sets_aliases`, `02494_optimize_group_by_function_keys_and_alias_columns`,
`00594_alias_in_distributed`, and `03034_normalized_ast`.

Key the wrap decision on the alias name, not the visit order. The first
visit of a new alias decides `wrap` vs `no-wrap` based on whether the
expression hash was already used by another alias, and every subsequent
visit of the same alias reuses that decision. Different aliases that
share an expression still get `identity` on the later one, which is the
behavior ClickHouse#85895 requires.
The previous revision set the alias on `column_expression` up-front and
then, when it decided to wrap, passed the already-aliased expression into
`identity`. That produced `identity(X AS e) AS e` — the same alias on both
the inner expression and the outer wrapper. The analyzer later saw two
distinct expressions for alias `e` and threw
`MULTIPLE_EXPRESSIONS_FOR_ALIAS` on `04090_distributed_duplicate_alias_columns`
and `00594_alias_in_distributed`.

Factor the expansion tail into `finalizeExpansion`, which either:
  - attaches the alias to the bare expression (when the alias is seen for
    the first time), or
  - strips any lingering alias from the inner expression and attaches the
    alias only to the `identity` wrapper.
The alias is therefore bound to exactly one node, matching what the
Planner and analyzer expect.
Comment thread src/Storages/StorageDistributed.cpp Outdated
Audit of the `identity()` cache fix pointed out that the regression test
only exercised the original bug (two distinct aliases that share one
expression). The class of query the cache specifically guards against —
the *same* ALIAS column referenced from more than one clause — wasn't
covered, so a regression of that logic would slip through.

Add four queries that each reference the same ALIAS column from multiple
positions:

- SELECT + GROUP BY
- SELECT + GROUP BY + HAVING
- SELECT + ORDER BY
- Two aliases `d` and `e`, each referenced more than once in the output,
  so the visitor both hits the cache *and* the identity wrap in the same
  query.

If the visitor ever returns a mismatched expansion between visits of the
same alias, these cases raise MULTIPLE_EXPRESSIONS_FOR_ALIAS and fail.
Comment thread tests/queries/0_stateless/04090_distributed_duplicate_alias_columns.sql Outdated
Comment thread src/Storages/StorageDistributed.cpp Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged current master (branch was ~1537 commits behind); the merge is conflict-free and the three red checks were not caused by this PR:

  • 00096_obfuscator_save_load and 00175_obfuscator_schema_inference — flaky (3/12 and 2/10 reruns), unrelated to a StorageDistributed change.
  • Unit tests (asan_ubsan) — job-level only; the detailed report shows 0 failing tests (the UntrackedMemoryRegistry UBSan issue was reverted on master, now in this branch via the merge).

Root cause (reproduced on master). Plain SELECT d, e FROM dist already works; only the aggregate case fails. After ReplaseAliasColumnsVisitor expands sum(d)/sum(e) into sum(b + c)/sum(b + c), the planner deduplicates the identical aggregate action node into a single column, so the shard produces 1 column while the Distributed header expects 2 → NUMBER_OF_COLUMNS_DOESNT_MATCH.

Design note. The current fix disambiguates by wrapping later occurrences in identity. As flagged in review, this wrapping is applied to every occurrence of the duplicate alias, including WHERE/PREWHERE. Since identity disables index analysis, a table like d ALIAS key, e ALIAS key with SELECT d, e FROM dist WHERE e = 42 would lose primary-key/skip-index pruning on the shards. So the column-count bug is fixed, but at the cost of pruning in predicate contexts.

I've held off on further changes to the identity mechanism pending a decision on direction (refine to keep predicate expansions bare and disambiguate only projected/aggregate action names, vs. a root-cause fix such as not expanding a duplicate ALIAS for shards that already expose the column). Happy to implement whichever you prefer.

The internal function `__actionName(expression, 'name')` overrides the action
name of `expression` with the constant 'name' (see `PlannerActionsVisitor`).
The 2024 hardening against direct calls restricted both arguments to `String`,
which was fine while the function was generated only for constants. To reuse it
for keeping duplicate `ALIAS` column expansions distinct in queries sent to
shards of Distributed tables, the first argument (the renamed expression, which
is returned unchanged) may now have any type. The second argument (the name)
must still be a genuine `String`, constant and non-empty, so the protection
against malformed direct calls is preserved.

Related: ClickHouse#85895
Related: ClickHouse#101789
Replace the `identity` wrapping of duplicate `ALIAS` column expansions in
`ReplaceAliasColumnsVisitor` (also fixing the `Replase` typo in its name) with
the internal function `__actionName(expression, 'name')`, which exists exactly
for pinning planner action names and is a no-op at runtime:

- Any number of `ALIAS` columns sharing one expression stay distinct, because
  every column gets its own name. The `identity` approach kept only two:
  the third and subsequent duplicates produced identical `identity(...)` trees
  that were collapsed again, and the query failed with the same
  `NUMBER_OF_COLUMNS_DOESNT_MATCH` exception.
- The wrap decision is computed by a deterministic two-pass traversal scoped
  per column source instead of a visit-order-dependent cache keyed by alias
  name alone, so equal alias names of different tables cannot be conflated.
- Occurrences in WHERE and PREWHERE are expanded bare: filters are computed
  entirely on the shards, and an opaque wrapper would prevent primary key and
  skipping index analysis there. `identity` disables index analysis by design,
  so the previous approach pessimized filtering by a duplicate `ALIAS` column
  to a full scan.

The test now also covers three duplicates, GROUP BY keys, subqueries, GLOBAL
JOIN of two distributed tables, `ALIAS` columns missing on the shard table,
equal alias names with different expressions in two sources, and verifies that
primary key analysis still prunes granules when filtering by a duplicate
`ALIAS` column.

Closes: ClickHouse#85895
@alexey-milovidov

Copy link
Copy Markdown
Member

I reworked the fix to address the review feedback (commits 3f5995f1c5f, dfd7798ef2e):

  • Instead of identity, duplicate ALIAS column expansions are wrapped into the internal function __actionName(expression, 'name'), which exists exactly for pinning planner action names (see PlannerActionsVisitor) and is a no-op at runtime. Its first argument may now have any type (previously restricted to String); the name argument must still be a constant, non-empty, genuine String.
  • Every ALIAS column gets its own action name, so any number of duplicates stay distinct. The identity version handled only two: with three duplicates (d, e, f all ALIAS b + c) the second and third wraps produced identical identity(b + c) trees, which were collapsed again, and the query kept failing with the same NUMBER_OF_COLUMNS_DOESNT_MATCH exception (verified).
  • The wrap decision is a deterministic two-pass traversal scoped per column source instead of a visit-order-dependent cache keyed by alias name, so equal alias names of different tables cannot be conflated, and the MULTIPLE_EXPRESSIONS_FOR_ALIAS workarounds are unnecessary by construction.
  • WHERE/PREWHERE occurrences are expanded bare, so primary key and skipping index analysis on the shards is unaffected (identity disables index analysis by design). Verified: filtering 2M rows by a duplicate ALIAS of the sorting key expression reads 16384 rows, not 2M.

The test now covers three duplicates, GROUP BY keys, subqueries, GLOBAL JOIN, ALIAS columns missing on the shard table, equal alias names with different expressions in two sources, and asserts granule pruning via read_rows. All of 00594_alias_in_distributed, 01576_alias_column_rewrite, 01923_different_expression_name_alias, 02008_aliased_column_distributed_bug, 02517_wrong_total_structure_crash, 03035_alias_column_bug_distributed, 03402_join_using_alias, 03035_internal_functions_direct_call, and 04201_action_name_lowcardinality_args pass locally against the patched build.

Note: a GLOBAL JOIN of two different distributed tables that both define the same alias name fails with MULTIPLE_EXPRESSIONS_FOR_ALIAS — that is a pre-existing problem on master (the expansion binds the same alias to two different trees) and is independent of this fix.

@alexey-milovidov

Copy link
Copy Markdown
Member

@groeneai, investigate the failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101789&sha=9761f4206efb6c6de7509e18e9beef83c44444f2&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%204%2F6%29 (test_profile_max_sessions_for_user/test.py::test_profile_max_sessions_for_user_postgres timed out after 900 s under MSan) and provide a fix in a separate PR. If the fix is already in progress, link it here. It is unrelated to this PR, which only changes the ALIAS column rewrite for Distributed tables.


column_expression->removeAlias();

auto action_name = fmt::format("__aliasColumn_{}_{}", source_ordinals.at(column_source), column_name);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated action name can collide with a real input column name. For a valid table with a physical column named __aliasColumn_0_d plus duplicate aliases d ALIAS b + c, e ALIAS b + c, a query like SELECT __aliasColumn_0_d, d, e FROM dist puts the physical column and the wrapper for d in the same ActionsDAG namespace. PlannerActionsVisitor / ActionsScopeNode::addInputColumnIfNecessary and addFunctionIfNecessary both reuse an existing node by result name, so whichever expression is visited first wins and the other column can silently return the wrong values. Please generate these names through a namespace that cannot collide with user columns, for example by using the planner's column identifiers or unique-id machinery and checking GlobalPlannerContext::hasColumnIdentifier, and add a regression with a physical __aliasColumn_0_d column.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov Confirmed: this is the known test_profile_max_sessions_for_user_postgres flaky timeout under MSan, unrelated to your ALIAS column rewrite. Already root-caused, and a separate fix is in progress.

Fix PR (in progress): #107086 (test-only, CI clean, awaiting review).

Root cause. threaded_run_test starts 3 postgres sessions (limit 2), waits for the "overflown session count" log (which fires at session acquisition, before query execution), then issues a single KILL QUERY WHERE user='test_user' SYNC plus a bare thread.join(). If an accepted session's infinite SELECT ... FROM system.numbers has not yet registered in system.processes when that one-shot KILL snapshots the process list, it is never killed and runs forever. psycopg2 has no client-side query timeout, so the worker thread blocks in fetchall() until the pytest 900s test-level limit. The native variants (tcp/http/mysql/grpc) are masked by the 600s query timeout in helpers/client.py; postgres has none, and MSan widens the connect-to-execute window, which is why it is postgres+MSan specific.

CIDB (30d). 14 _postgres timeouts across 13 distinct PRs plus master, 11/14 on amd_msan; 1 each for tcp/http/mysql/grpc. No regression (test.py unchanged since 2024).

Fix in #107086. Replace the one-shot KILL + bare join with a bounded loop that re-issues KILL QUERY ... SYNC while any session thread is still alive. The session-overflow assertion is unchanged.

alexey-milovidov and others added 2 commits June 12, 2026 20:09
The fix for issue ClickHouse#85895 lives entirely in the new analyzer's distributed
query-tree rewrite (`buildQueryTreeDistributed`, wrapping duplicate `ALIAS`
expansions into the internal `__actionName` function). The old analyzer uses a
separate AST-based path that cannot resolve `ALIAS` columns of a `GLOBAL JOIN`
right-hand table (`UNKNOWN_IDENTIFIER` in `ExpressionAnalyzer` /
`GlobalSubqueriesVisitor`), which is an unrelated, pre-existing limitation. Tag
the test `no-old-analyzer` so it runs only where the fix applies.

Failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101789&sha=dfd7798ef2e74556fed1933e5f312fd160322ac2&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20old%20analyzer%2C%20s3%20storage%2C%20DatabaseReplicated%2C%20WasmEdge%2C%20parallel%29

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged master and fixed the one real CI failure.

  • 04090_distributed_duplicate_alias_columns (old analyzer config): the fix lives entirely in the new analyzer's distributed query-tree rewrite (buildQueryTreeDistributed, wrapping duplicate ALIAS expansions into __actionName). On the old analyzer the GLOBAL JOIN case fails with UNKNOWN_IDENTIFIER: Unknown column: e deep in ExpressionAnalyzer/GlobalSubqueriesVisitor — a separate, pre-existing old-analyzer limitation unrelated to this PR. Tagged the test no-old-analyzer so it runs only where the fix applies (commit 448cd3503ac).
  • BuzzHouse (amd_debug)Inconsistent AST formatting (STID: 1941-1bfa): unrelated known flake (emoji literals + tupleElement/arrayElement), tracked in Logical error: 'Inconsistent AST formatting: the original AST: (STID: 1941-1bfa) #106850.

One unresolved review thread remains (the generated __actionName action name __aliasColumn_<n>_<name> could in principle collide with a physical column of the same name) — left for @alexey-milovidov as a design call on the naming scheme.

void replace(QueryTreeNodePtr & node, bool is_filter_context)
{
if (auto column_expression = getColumnNodeAliasExpression(node, is_filter_context))
node = std::move(column_expression);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only expands the alias column once. If an alias points at another alias, for example d ALIAS b + c, e ALIAS d, getColumnNodeAliasExpression for e returns the ColumnNode for d; after this assignment the visitor descends into d's children, but it never replaces the new root ColumnNode d with b + c. The shard query can still contain d AS e, which breaks the stated contract when aliases exist only on the Distributed table and are missing on the shard table. It also means duplicate detection misses aliases that become equal only after nested ALIAS expansion.

Please expand recursively until the replacement root is no longer an ALIAS ColumnNode (or normalize recursively before computing duplicate groups), and add a regression with nested aliases, ideally with the shard table lacking both alias columns.

Comment thread src/Functions/identity.h
static FunctionPtr create(ContextPtr) { return std::make_shared<FunctionActionName>(); }
size_t getNumberOfArguments() const override { return 2; }
ColumnNumbers getArgumentsThatAreAlwaysConstant() const override { return {0, 1}; }
ColumnNumbers getArgumentsThatAreAlwaysConstant() const override { return {1}; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relaxing the first argument from constant to arbitrary expressions makes direct __actionName calls able to forge any ActionsDAG node name. PlannerActionsVisitor trusts the second argument as the result name, while ActionsScopeNode::addFunctionIfNecessary reuses an existing node with that name, so a direct query such as SELECT __actionName(number + 1, 'number') FROM numbers(3) can resolve to the existing input node number instead of computing number + 1. This was previously rejected because the first argument had to be constant, and the tests now explicitly allow non-constant first arguments.

Please keep the direct-call hardening: either make the arbitrary-expression form generated-only, or reject/namespace __actionName names that collide with existing action nodes so malformed direct calls fail instead of returning wrong results.

@alexey-milovidov

Copy link
Copy Markdown
Member

Merged master again. The only red check, Build (arm_tidy), was a stale base: the branch still carried the pre-fix applyHistogramQuantile.cpp with an uninitialized out_of_range_value (cppcoreguidelines-init-variables), which has since been fixed on master. After the merge the variable is initialized and the tidy build is unblocked. The merge is conflict-free and the PR diff is unchanged (8 files).

@clickhouse-gh

clickhouse-gh Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.80% 84.80% +0.00%
Functions 92.40% 92.40% +0.00%
Branches 77.30% 77.40% +0.10%

Changed lines: Changed C/C++ lines covered by tests: 73/73 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate alias does not work with distributed table

3 participants