Stabilize 04229_trivial_group_by_limit_profile_events under parallel replicas by groeneai · Pull Request #104996 · ClickHouse/ClickHouse · GitHub
Skip to content

Stabilize 04229_trivial_group_by_limit_profile_events under parallel replicas#104996

Merged
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-04229-parallel-replicas
May 17, 2026
Merged

Stabilize 04229_trivial_group_by_limit_profile_events under parallel replicas#104996
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-04229-parallel-replicas

Conversation

@groeneai

@groeneai groeneai commented May 15, 2026

Copy link
Copy Markdown
Contributor

Reported by @alexey-milovidov on #100146:

Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel) — only failure is 04229_trivial_group_by_limit_profile_events. This test was added in #104473 and has been failing in this exact profile across many unrelated PRs (e.g. #100146, #101033, #104694, #104966). The optimization apparently doesn't fire under the ParallelReplicas profile and the test does not account for that.

Root cause

OptimizeTrivialGroupByLimitPass runs in the analyzer on the initiator and lowers max_rows_to_group_by on the query's local context (mutable_context->setSetting("max_rows_to_group_by", max_rows)). That mutation is not propagated to remote replicas, so once CI's automatic_parallel_replicas_mode = 2 randomization is sampled (which sets enable_parallel_replicas = 1 + parallel_replicas_for_non_replicated_merge_tree = 1 + cluster_for_parallel_replicas = 'parallel_replicas'), the aggregator on the remotes never applies the cap and OverflowAny stays at zero. The test then sees 04229_on 0 instead of the expected 04229_on 1 and fails 3/3 reruns.

CIDB cross-PR scope (30 days): the failure has been observed on at least PRs #104473 (the source PR), #100146, #101033, #104694, #104966 — all unrelated to the test or to each other.

Fix

Pin enable_parallel_replicas = 0 at the query level in both SETTINGS clauses so the optimization is exercised on the local-coordinator path it is designed for. Query-level SETTINGS takes priority over the runner's session-level injection, so the pin is robust against any further randomization. No no-* tag is added (per .claude/CLAUDE.md guidance) and no other randomization is disabled — max_threads = 1 was already pinned, everything else stays randomized.

The test's intent is preserved: it still checks that OverflowAny is incremented when OptimizeTrivialGroupByLimitPass fires and stays at zero when it doesn't.

Reproduction & verification

Locally with a 3-replica default_parallel_replicas cluster on loopback (localhost, 127.0.0.2, 127.0.0.3 on port 9700), simulating the runner's parallel-replicas injection:

Test Without fix With fix
5 iterations under parallel-replicas randomization 0/5 pass — every run shows 04229_on 0 20/20 pass

If you want to test the bug interactively, here is a one-line repro against any server with a multi-replica cluster:

SELECT k FROM t GROUP BY k LIMIT 5 FORMAT Null
SETTINGS optimize_trivial_group_by_limit_query = 1, max_threads = 1,
    enable_parallel_replicas = 1, parallel_replicas_for_non_replicated_merge_tree = 1,
    cluster_for_parallel_replicas = '<your_cluster>', parallel_replicas_local_plan = 0;

Then SELECT ProfileEvents['OverflowAny'] FROM system.query_log WHERE type = 'QueryFinish' returns 0 instead of the expected non-zero value.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Version info

  • Merged into: 26.5.1.734

…replicas

`OptimizeTrivialGroupByLimitPass` runs in the analyzer on the initiator
and lowers `max_rows_to_group_by` on the query's local context. That
mutation is not propagated to remote replicas, so once CI's
`automatic_parallel_replicas_mode=2` randomization is sampled the
aggregator on the remotes never applies the cap and `OverflowAny` stays
at zero. The test then sees `04229_on   0` instead of
`04229_on   1` and fails 3/3 reruns (every PR run hits the same diff).

Pin `enable_parallel_replicas = 0` at the query level in both `SETTINGS`
clauses so the optimization is exercised on the local-coordinator path
it is designed for. Query-level `SETTINGS` takes priority over the
runner's session-level injection, so the pin is robust against
randomization. No `no-*` tag is added (per
`.claude/CLAUDE.md` guidance) and no other randomization is disabled —
`max_threads = 1` was already pinned, the rest stays randomized.

Locally reproduced with a 3-replica `default_parallel_replicas` cluster
on loopback: 5/5 fails without the pin (`04229_on   0`), 20/20 passes
with the pin under the same parallel-replicas session settings.

Cross-PR scope (CIDB, 30d): seen on PRs ClickHouse#104473, ClickHouse#100146, ClickHouse#101033,
ClickHouse#104694, ClickHouse#104966 — all unrelated to the test or its source PR.

Reported by @alexey-milovidov on
ClickHouse#100146.
@groeneai

Copy link
Copy Markdown
Contributor Author

cc @alexey-milovidov — could you review this? Test-only fix: pins enable_parallel_replicas = 0 at the query level so OptimizeTrivialGroupByLimitPass reliably fires on the local-coordinator path the test was written for, instead of being silently neutralized by the runner's automatic_parallel_replicas_mode randomization.

Pre-PR validation self-check

  • (a) Deterministic repro? Yes — locally reproduced 5/5 fails with enable_parallel_replicas=1 + parallel_replicas_for_non_replicated_merge_tree=1 + cluster_for_parallel_replicas='default_parallel_replicas' + parallel_replicas_local_plan=0 against a 3-replica loopback cluster. Every run shows 04229_on 0 instead of the expected 04229_on 1.
  • (b) Root cause explained? Yes — OptimizeTrivialGroupByLimitPass calls query->getMutableContext()->setSetting("max_rows_to_group_by", max_rows). That mutates the initiator's query context. The mutation is not propagated to remote replicas, so the aggregators on the remotes never apply the cap and ProfileEvents::OverflowAny is never incremented.
  • (c) Fix matches root cause? Yes — the test is for the local-coordinator behavior of the pass; pinning enable_parallel_replicas = 0 keeps the query on that path. No no-* tag (per .claude/CLAUDE.md guidance) and no assertion is widened.
  • (d) Test intent preserved? Yes — the test still distinguishes "optimization fires → OverflowAny > 0" from "optimization disabled → OverflowAny == 0". Only enable_parallel_replicas is pinned; all other randomization (max_threads was already pinned to 1 for determinism, everything else stays randomized).
  • (e) Both directions demonstrated? Yes — original test: 0/5 pass under parallel-replicas injection (matches CI's 3/3 reruns diff exactly). Fixed test: 20/20 pass under the same injection.

Note on follow-up

This PR only stabilizes the test. The underlying behavior — that the analyzer pass mutates settings the remote replicas never see — is a separate question. If you'd like the pass to also skip itself when canUseParallelReplicasOnInitiator() is true (so OverflowAny reliably stays at zero under parallel replicas instead of being a no-op silently), happy to send that as a follow-up.

Session: cron:clickhouse-ci-task-worker:20260515-044500

@Algunenano Algunenano added the can be tested Allows running workflows for external contributors label May 15, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

@Algunenano Algunenano self-assigned this May 15, 2026
@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 15, 2026
@alexey-milovidov alexey-milovidov merged commit bc410e1 into ClickHouse:master May 17, 2026
166 of 167 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants