Stabilize 04229_trivial_group_by_limit_profile_events under parallel replicas#104996
Conversation
…replicas `OptimizeTrivialGroupByLimitPass` runs in the analyzer on the initiator and lowers `max_rows_to_group_by` on the query's local context. That mutation is not propagated to remote replicas, so once CI's `automatic_parallel_replicas_mode=2` randomization is sampled the aggregator on the remotes never applies the cap and `OverflowAny` stays at zero. The test then sees `04229_on 0` instead of `04229_on 1` and fails 3/3 reruns (every PR run hits the same diff). Pin `enable_parallel_replicas = 0` at the query level in both `SETTINGS` clauses so the optimization is exercised on the local-coordinator path it is designed for. Query-level `SETTINGS` takes priority over the runner's session-level injection, so the pin is robust against randomization. No `no-*` tag is added (per `.claude/CLAUDE.md` guidance) and no other randomization is disabled — `max_threads = 1` was already pinned, the rest stays randomized. Locally reproduced with a 3-replica `default_parallel_replicas` cluster on loopback: 5/5 fails without the pin (`04229_on 0`), 20/20 passes with the pin under the same parallel-replicas session settings. Cross-PR scope (CIDB, 30d): seen on PRs ClickHouse#104473, ClickHouse#100146, ClickHouse#101033, ClickHouse#104694, ClickHouse#104966 — all unrelated to the test or its source PR. Reported by @alexey-milovidov on ClickHouse#100146.
|
cc @alexey-milovidov — could you review this? Test-only fix: pins Pre-PR validation self-check
Note on follow-upThis PR only stabilizes the test. The underlying behavior — that the analyzer pass mutates settings the remote replicas never see — is a separate question. If you'd like the pass to also skip itself when Session: cron:clickhouse-ci-task-worker:20260515-044500 |

Reported by @alexey-milovidov on #100146:
Root cause
OptimizeTrivialGroupByLimitPassruns in the analyzer on the initiator and lowersmax_rows_to_group_byon the query's local context (mutable_context->setSetting("max_rows_to_group_by", max_rows)). That mutation is not propagated to remote replicas, so once CI'sautomatic_parallel_replicas_mode = 2randomization is sampled (which setsenable_parallel_replicas = 1+parallel_replicas_for_non_replicated_merge_tree = 1+cluster_for_parallel_replicas = 'parallel_replicas'), the aggregator on the remotes never applies the cap andOverflowAnystays at zero. The test then sees04229_on 0instead of the expected04229_on 1and fails 3/3 reruns.CIDB cross-PR scope (30 days): the failure has been observed on at least PRs #104473 (the source PR), #100146, #101033, #104694, #104966 — all unrelated to the test or to each other.
Fix
Pin
enable_parallel_replicas = 0at the query level in bothSETTINGSclauses so the optimization is exercised on the local-coordinator path it is designed for. Query-levelSETTINGStakes priority over the runner's session-level injection, so the pin is robust against any further randomization. Nono-*tag is added (per.claude/CLAUDE.mdguidance) and no other randomization is disabled —max_threads = 1was already pinned, everything else stays randomized.The test's intent is preserved: it still checks that
OverflowAnyis incremented whenOptimizeTrivialGroupByLimitPassfires and stays at zero when it doesn't.Reproduction & verification
Locally with a 3-replica
default_parallel_replicascluster on loopback (localhost,127.0.0.2,127.0.0.3on port 9700), simulating the runner's parallel-replicas injection:04229_on 0If you want to test the bug interactively, here is a one-line repro against any server with a multi-replica cluster:
Then
SELECT ProfileEvents['OverflowAny'] FROM system.query_log WHERE type = 'QueryFinish'returns0instead of the expected non-zero value.Changelog category (leave one):
Version info
26.5.1.734