Add a test for #48881, but it shows a bug with Parallel Replicas by alexey-milovidov · Pull Request #80310 · ClickHouse/ClickHouse · GitHub
Skip to content

Add a test for #48881, but it shows a bug with Parallel Replicas#80310

Merged
alexey-milovidov merged 59 commits into
masterfrom
add-test-48881
Jun 9, 2026
Merged

Add a test for #48881, but it shows a bug with Parallel Replicas#80310
alexey-milovidov merged 59 commits into
masterfrom
add-test-48881

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 16, 2025

Copy link
Copy Markdown
Member

Closes: #48881
Related: #74324
Related: #103806

This started as a regression test for #48881 (wrong primary key analysis with an offset on the order field). Adding the test surfaced a MULTIPLE_EXPRESSIONS_FOR_ALIAS exception under parallel replicas for queries with self-shadowing projection aliases such as SELECT *, day + 365 AS day.

#103806 deduplicated conflicting projection aliases only on the top-level SELECT of the dispatched query, so the conflict was still reachable when it lived in a nested subquery (exactly the shape of the #48881 repro: SELECT count() FROM (... SELECT *, day + 365 AS day ...)). This PR refactors the deduplication (deduplicateProjectionAliases) to walk every SELECT at any nesting level of the dispatched AST, so the test now runs with parallel replicas enabled.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a MULTIPLE_EXPRESSIONS_FOR_ALIAS exception for queries with duplicate projection aliases (for example SELECT *, day + 365 AS day) inside nested subqueries when running with parallel replicas.

Version info

  • Merged into: 26.6.1.526

@clickhouse-gh

clickhouse-gh Bot commented May 16, 2025

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label May 16, 2025
@alexey-milovidov alexey-milovidov changed the title Add a test for #48881 Add a test for #48881, but it shows a bug with Parallel Replicas May 19, 2025
@alexey-milovidov alexey-milovidov marked this pull request as draft May 19, 2025 06:22
@devcrafter devcrafter self-assigned this May 21, 2025
@devcrafter

Copy link
Copy Markdown
Member

Issue related to failed test in CI run with PR - #80440

@clickhouse-gh

clickhouse-gh Bot commented Jul 14, 2025

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [bfa8ad2]

Summary:


AI Review

Summary

The PR adds a regression test for #48881 and now also fixes recursive duplicate projection alias stripping in queryNodeToDistributedSelectQuery, so nested subqueries sent to remote replicas no longer trip MULTIPLE_EXPRESSIONS_FOR_ALIAS. I did not find a code-level blocker in the current diff, but the PR metadata is stale now that the PR contains a user-visible bug fix.

PR Metadata
  • The current Changelog category is Not for changelog, but the PR now changes user-visible behavior by fixing a MULTIPLE_EXPRESSIONS_FOR_ALIAS exception under parallel replicas.
  • Suggested replacement:
### Changelog category (leave one):
- Bug Fix

### Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix `MULTIPLE_EXPRESSIONS_FOR_ALIAS` for parallel-replica queries with duplicate projection aliases inside nested subqueries.
Findings

💡 Nits

  • [src/Planner/Utils.cpp:280] The recursive alias-deduplication call makes this PR a real bug fix rather than a test-only change, so the release-note metadata should be updated as shown above. Posted an inline comment on this line.
Final Verdict

✅ No code-level blockers found. Please fix the PR metadata before merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@devcrafter, a problem with parallel replicas.

@clickhouse-gh

clickhouse-gh Bot commented Sep 16, 2025

Copy link
Copy Markdown
Contributor

Dear @alexey-milovidov, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

@clickhouse-gh

clickhouse-gh Bot commented Dec 23, 2025

Copy link
Copy Markdown
Contributor

Dear @alexey-milovidov, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

alexey-milovidov and others added 7 commits February 6, 2026 04:25
The test uses `SELECT *, day + 365 AS day` which triggers
`MULTIPLE_EXPRESSIONS_FOR_ALIAS` under Parallel Replicas due to a
known analyzer bug (#80440). Tag with `no-parallel-replicas` to avoid
the unrelated failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread tests/queries/0_stateless/03517_wrong_primary_key_analysis.sql Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member Author

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

@alexey-milovidov alexey-milovidov marked this pull request as ready for review April 25, 2026 12:38
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, the Stress test (amd_asan_ubsan) failure on b466ac19 is the Hung check on SYSTEM SYNC MERGES t_manual; from 04207_schedule_merge_basic.sql. That test was added in master commit aa5966d99cd and is unrelated to this PR's changes (which only touch tests/queries/0_stateless/03517_wrong_primary_key_analysis.{sql,reference}).

Log: https://s3.amazonaws.com/clickhouse-test-reports/PRs/80310/b466ac191f60efc6ced748be0f2ec8fedc19b661/stress_test_amd_asan_ubsan/hung_check.log

Could you investigate the failure and provide a fix in a separate PR? If a fix is already in progress, please link it here.

alexey-milovidov and others added 5 commits May 23, 2026 04:40
…y_analysis

Issue #74324 (`MULTIPLE_EXPRESSIONS_FOR_ALIAS` under parallel replicas) was
closed by the fix #103806, so the old comment ("remove this guard once #74324
is fixed") is misleading: it invites removing the guard now that #74324 is
closed.

However, #103806 covered other shapes of that bug, not the self-shadowing
alias used by this test (`SELECT *, day + 365 AS day`, where `day` is also a
column). Verified locally against master with #103806 present: under forced
parallel replicas the remote replica's analyzer still sees two bodies for the
alias `day` and throws `MULTIPLE_EXPRESSIONS_FOR_ALIAS`. The `SET enable_parallel_replicas = 0`
guard is therefore still required for this test to stay green in the
`ParallelReplicas` CI suite (the in-query `SET` overrides the suite-level
forcing). Auto-PR statistics mode (`automatic_parallel_replicas_mode = 2`)
already passes, since it does not actually execute with parallel replicas.

Update the comment to state this accurately and tie removal of the guard to the
analyzer handling this shape, not to #74324 being closed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Updated the branch onto current `master` (was 574 commits behind), which now includes the merged #103806.

I verified the parallel-replicas situation empirically on a fresh build at `3c6ba95806e` (includes #103806), running the test queries against `test_cluster_one_shard_three_replicas_localhost` with `enable_parallel_replicas = 1, max_parallel_replicas = 3`:

  • The self-shadowing alias shape in this test — `SELECT *, day + 365 AS day` over `SELECT *` — still reproduces `MULTIPLE_EXPRESSIONS_FOR_ALIAS` from the remote replica even after Fix MULTIPLE_EXPRESSIONS_FOR_ALIAS with enabled parallel replicas #103806:

    Code: 179. Multiple expressions __table3.day + 365 AS day and __table3.day AS day for alias day ... (MULTIPLE_EXPRESSIONS_FOR_ALIAS)

  • With the `SET enable_parallel_replicas = 0` guard, the full test passes deterministically (4× `107`, matching the reference).

So #103806 covered the `PREWHERE`-alias and `SELECT *`-over-JOIN shapes, but not this self-shadowing alias over `SELECT *`. The explicit guard is therefore still required, and the comment in the test is accurate. Note that #74324 is currently closed as completed by #103806, yet this shape still reproduces — it may warrant reopening or a follow-up issue.

The only CI failure is the chronic `Assertion 'px != 0' failed` terminate in `QueryPlanResourceHolder.cpp` (STID 3685-4830), triggered by an unrelated fuzzer query (`SELECT DISTINCT 1025, toFixedString(...)`) and tracked under #106079 — unrelated to this test-only change.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Updated the branch onto current master (was 576 commits behind), clean merge with no conflicts. The diff remains test-only (03517_wrong_primary_key_analysis.{sql,reference}).

Re-verified the parallel-replicas situation on a fresh master build (rev 0ea13fba9e7, post-#103806), running the test queries against a three-replica cluster with enable_parallel_replicas = 1, max_parallel_replicas = 3:

  • The self-shadowing alias shape — SELECT *, day + 365 AS day over SELECT *still reproduces MULTIPLE_EXPRESSIONS_FOR_ALIAS from the remote replica:

    Code: 179. Multiple expressions __table3.day + 365 AS day and __table3.day AS day for alias day ... (MULTIPLE_EXPRESSIONS_FOR_ALIAS)

  • With the SET enable_parallel_replicas = 0 guard, the full test passes deterministically (4× 107, matching the reference) both via clickhouse-local and against a real server.

So the explicit guard is still required and the comment in the test remains accurate. #74324 is still effectively open for this self-shadowing shape — #103806 covered the PREWHERE-alias and SELECT *-over-JOIN cases but not this one.

Comment thread tests/queries/0_stateless/03517_wrong_primary_key_analysis.sql Outdated
…l replicas

The self-shadowing alias shape `SELECT *, day + 365 AS day` serializes to
`__table3.day AS day, __table3.day + 365 AS day` — the same alias `day` bound
to two different bodies. When the analyzed query tree is converted back to AST
and dispatched to a remote replica for parallel replicas, the receiver's
analyzer re-resolves it and throws `MULTIPLE_EXPRESSIONS_FOR_ALIAS` (issue
#74324).

`queryNodeToDistributedSelectQuery` already deduplicated such aliases — stripping
the alias on the later, different-body occurrence — but only on the top-level
`SELECT`. In `SELECT count() FROM (... SELECT *, day + 365 AS day ...)` the
conflict lives in a nested subquery, which the dedup never reached, so the query
still failed. The reproduction from #48881 hits exactly this shape.

Apply the dedup recursively to every `SELECT` in the dispatched AST. Each
subquery is its own alias scope and the whole AST is re-analyzed on the receiver,
so the conflict can arise at any nesting level. The same-alias-same-body
duplicates are still preserved (only different bodies are stripped) to avoid
`THERE_IS_NO_COLUMN` on the receiver, and an outer reference to the alias binds
to the first occurrence regardless, so stripping the later alias is safe.

This unblocks `03517_wrong_primary_key_analysis` (test for #48881), which now
runs deterministically with parallel replicas instead of needing an explicit
`enable_parallel_replicas = 0` guard.

Refs #74324
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Fixed the parallel-replicas bug so the test runs with parallel replicas (the explicit enable_parallel_replicas = 0 guard is removed).

Root cause. The self-shadowing alias shape SELECT *, day + 365 AS day produces a projection __table3.day AS day, __table3.day + 365 AS day — the same alias day bound to two different bodies. When the analyzed query tree is converted back to AST and dispatched to a remote replica, the receiver re-analyzes it and throws MULTIPLE_EXPRESSIONS_FOR_ALIAS (#74324). queryNodeToDistributedSelectQuery already deduplicated such aliases — stripping the later, different-body occurrence — but only on the top-level SELECT. In SELECT count() FROM (... SELECT *, day + 365 AS day ...) the conflict lives in a nested subquery, which the dedup never reached, so #103806 left this shape failing.

Fix (f8312827231): apply the dedup recursively to every SELECT in the dispatched AST. Each subquery is its own alias scope and the whole AST is re-analyzed on the receiver, so the conflict can arise at any nesting level. Same-alias-same-body duplicates are still preserved (only different bodies are stripped) to avoid THERE_IS_NO_COLUMN, and an outer reference to the alias binds to the first occurrence regardless, so dropping the later alias is safe.

Verified locally against a three-replica cluster (default_parallel_replicas) with enable_parallel_replicas = 1, max_parallel_replicas = 10, parallel_replicas_for_non_replicated_merge_tree = 1:

This effectively completes #74324 for the self-shadowing-alias-over-SELECT * shape that #103806 did not cover.

Also updated the branch onto current master (was 464 commits behind).

@groeneai, the Stateless tests (amd_debug, parallel) failures (Server died and Logical error: Virtual row boundary violated in MergingSortedAlgorithm ... (STID: 2651-3359)) are unrelated to this test-only/serialization change: the checkVirtualRowBoundary debug assertion was added in master commit 77cf854a3e2 and is currently firing across master (it appears on pull_request_number = 0 and ~30 other PRs in CIDB over the last two days). Could you investigate and provide a fix in a separate PR? If one is already in progress, please link it here.

Comment thread src/Planner/Utils.cpp
@groeneai

groeneai commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov yes, fix is in flight: #106664.

STID 2651-3359 (Virtual row boundary violated in MergingSortedAlgorithm) is chronic across master and ~30 unrelated PRs since the checkVirtualRowBoundary debug assertion was added in commit 77cf854, matching your analysis. The Server died line in the same Stateless tests (amd_debug, parallel) is the downstream cascade of the same LOGICAL_ERROR debug abort (verified across PR #105386, PR #93462, and master).

Root cause: when apply_virtual_row_conversions = false (Union of read-in-order subtrees with per-source conversions, e.g. Distributed two-shards with PK alias columns), setVirtualRow skips the per-source rename, the pk_block columns no longer match the outer merge header, and the recorded boundary in rememberVirtualRowBoundary is a default Decimal64_'0.000000000' that the assertion trips on the first real row. Release builds were already correct because mergeImpl/mergeBatchImpl discard the virtual row via skip_last_row=true; only debug builds aborted.

Fix in #106664: gate rememberVirtualRowBoundary and checkVirtualRowBoundary on apply_virtual_row_conversions = true in MergingSortedAlgorithm.cpp. Regression test 04323_virtual_row_two_shards_pk_alias.sql aborts on master and passes with the fix.

Session: cron:clickhouse-ci-task-worker:20260607-080600

@clickhouse-gh

clickhouse-gh Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.60% 84.50% -0.10%
Functions 92.30% 92.30% +0.00%
Branches 77.20% 77.20% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 23/25 (92.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov marked this pull request as ready for review June 9, 2026 01:14
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 9, 2026
Merged via the queue into master with commit b51d48f Jun 9, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the add-test-48881 branch June 9, 2026 01:30
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query with offset on order field produces incorrect results

4 participants