Randomize parallel_replicas_min_number_of_rows_per_replica#71028
Randomize parallel_replicas_min_number_of_rows_per_replica#71028devcrafter wants to merge 19 commits into
Conversation
|
Dear @antaljanosbenjamin, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
|
Workflow [PR], commit [62da6d9] Summary: ❌
AI ReviewSummaryThis PR still makes the stateless randomized test harness globally generate Missing context / blind spots
Findings❌ Blockers
Final Verdict
|
|
@devcrafter, it shows errors. |
|
Dear @alexey-milovidov, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
|
@devcrafter, it failed. |
|
Dear @alexey-milovidov, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
|
@devcrafter, more changes are needed. |
|
Dear @alexey-milovidov, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
…o-read-setting # Conflicts: # tests/clickhouse-test
|
Merged current "parallel_replicas_min_number_of_rows_per_replica": lambda: random.randint(0, 1),Pushed Note on the prior (2026-03-28) CI failures: I spot-checked two of them against a fresh
The randomizer line behaves as intended: a value of |
|
Merged The remaining CI red is not caused by the one-line diff itself — it is the parallel-replicas bugs this PR is meant to surface. The CI randomized-settings diagnosis points at A. Genuine server bugs (block merge, outside the scope of this test-only change):
B. Single-node "measurement" tests that just need parallel replicas pinned off (contained test fixes):
The (B) tests can be made robust by pinning |
|
Re-merged current I re-triaged the latest CI red and it matches the prior categorization; nothing new is regressed by this PR, the failures are exactly the latent parallel-replicas bugs that Genuine server bugs (block merge, owned by the parallel-replicas team, out of scope for this test-only change):
Index-analysis / "measurement" tests — divergence is itself likely a real behavioral gap, so NOT safe to pin:
Deliberately did not tag/blacklist or pin parallel replicas off in any of these — pinning the index-analysis tests would mask exactly the kind of gap this PR is designed to expose. None of the remaining red is fixable inside this one-line test change; it needs the underlying parallel-replicas bugs fixed first. Flagging for your sequencing call (fix-then-randomize vs. land-and-track) — the failures are product bugs for the parallel-replicas owners rather than something to mask here. |
| "max_parsing_threads": lambda: random.choice([0, 1, 10]), | ||
| "optimize_functions_to_subcolumns": lambda: random.randint(0, 1), | ||
| "parallel_replicas_local_plan": lambda: random.randint(0, 1), | ||
| "parallel_replicas_min_number_of_rows_per_replica": lambda: random.randint(0, 1), |
There was a problem hiding this comment.
Randomizing this globally violates the randomized-test invariant that generated settings must still keep the stateless suite green. With automatic_parallel_replicas_mode = 2, adjust_settings_for_autopr forces enable_parallel_replicas, cluster_for_parallel_replicas, and parallel_replicas_local_plan; when this new setting is 1, the analyzer still builds the real parallel-replicas plan because buildQueryPlanForAutomaticParallelReplicas keeps this setting and only clears automatic_parallel_replicas_mode inside the candidate plan. The PR discussion already lists failures from that combination, including wrong results and the considerEnablingParallelReplicas assertion. Please keep this at 0 until those parallel-replica bugs are fixed, or gate the randomization behind a dedicated opt-in or affected-test exclusions so normal randomized CI stays green.
There was a problem hiding this comment.
Acknowledged — this is exactly the open sequencing question discussed in the PR-level comments (fix-then-randomize vs. land-and-track). The red CI here is deliberate: the failures are the latent parallel-replicas bugs this randomization is designed to surface (see the triage comments for the categorized list, e.g. lost WITH ROLLUP totals, wrong semi/anti/any join results, the considerEnablingParallelReplicas.cpp:359 assertion). Gating or excluding the affected tests would mask the very signal the PR exists to produce, so the deliberate choice so far has been not to do that. The PR will not be merged while these product bugs are unfixed — it stays open as the tracking/bug-finding vehicle until the parallel-replicas owners fix them, or until we decide to gate it.
|
@groeneai, investigate the failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=71028&sha=849e3ff013f70628c8daad3a4fda355753bc580b&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29 and provide a fix in a separate PR. If the fix is already in progress, link it here. The failure is |
|
Re-triaged the latest CI red (commit 849e3ff, run of 2026-06-09). No re-merge of The failures match the prior categorization, with three new members of the same classes (all reproduce only with
Independent confirmation that these are latent product bugs rather than an artifact of this diff: #106734 ("More settings to randomize") hits the exact same set of tests ( Still present from the previous rounds: One failure is unrelated to this PR — the Status unchanged: the remaining red is the latent parallel-replicas bugs this PR is designed to surface, and the sequencing call (fix-then-randomize vs. land-and-track) is still open. |
|
Investigated. This Root cause: the stateless test It started exactly on 2026-06-09 because #103544 made the server throw Reproduced locally: feeding the comment + |
|
Re-merged current The reason to re-merge now (vs. the 2026-06-11/12 triage, where re-merging would have just re-rolled the same red): the parallel-replicas owners have started landing fixes for the bugs this randomizer surfaces. Two directly relevant commits authored by @devcrafter are now on
Re-running CI on today's |
|
Re-merged current The reason to re-merge now (vs. the 2026-06-14..18 triage, where re-merging would have just re-rolled the same red): another directly relevant parallel-replicas JOIN fix has landed on
Re-running CI on today's The server-abort path ( The PR remains intentionally open as the bug-finding / tracking vehicle until the remaining parallel-replicas bugs are fixed or a sequencing decision is made — no pinning/blacklisting. |
…o-read-setting
The branch was ~11.7k commits behind (last merge 2026-06-20) and red. The
diff vs. `master` is still the single randomizer line — no functional change:
"parallel_replicas_min_number_of_rows_per_replica": lambda: random.randint(0, 1),
Reason to re-merge now: the parallel-replicas owners have landed several
directly relevant fixes since 2026-06-20, so refreshing CI on today's
`master` will narrow the remaining set of genuine product bugs this PR is
meant to surface:
- #108451 (`Fix NOT_FOUND_COLUMN_IN_BLOCK for virtual columns under parallel
replicas`, closes #106561) — should clear the tracked
`04098_asterisk_include_virtual_columns_mergetree` failure.
- #101434 (`Reimplement reading in order for parallel replicas`) — bears
directly on the `max_rows_to_read`-not-honored class
(`02155_read_in_order_max_rows_to_read`, `00945_bloom_filter_index`).
- #109003 (`Fix server abort on GROUPING SETS in a set operation with
parallel replicas`) — a "Server died" class fix.
- Flaky-test fix for `04051_pk_analysis_stats`.
Conflicts were all in files the branch does not intentionally change (its
only intended change is the one `tests/clickhouse-test` line); they were
resolved by taking `master`'s version. No pinning/blacklisting of the
affected parallel-replicas tests — that would mask the very signal this PR
exists to produce.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Re-merged current Reason to re-merge now (vs. just re-rolling the same red): the parallel-replicas owners have landed several directly relevant fixes since 2026-06-20, so refreshing CI on today's
Expected to remain red (no relevant change on
As before, I did not tag/blacklist or pin parallel replicas off in any test — pinning would mask exactly the signal this PR is designed to expose. The PR remains intentionally open as the bug-finding / tracking vehicle until the remaining parallel-replicas bugs are fixed or a sequencing decision is made. |
…rows-to-read-setting

Changelog category (leave one):
Details
The setting enables code execution, which can trigger hidden bugs, in particular in GLOBAL JOINs with parallel replicas. Discovered one while doing #70658 within
02967_parallel_replicas_joins_and_analyzertest