{{ message }}
Reimplement reading in order for parallel replicas#101434
Merged
Merged
Conversation
Contributor
Member
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
1 task
`all_parts_for_replicas = parts_with_ranges` previously ran for every non-initiator, including `parallel_replicas_local_plan=0`. The legacy single-pool branch consumes the list once with `std::move` — the copy was a pure full-vector waste on the in-order parallel-replica hot path. Only the local-plan follower path actually needs the copy (each split reads from a copy and filters down to its assigned subset). For the single-pool branch, move `parts_with_ranges` directly into the only pool that consumes it: that path never reached the split builder, so `parts_with_ranges` is still intact and there's no second consumer to worry about. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cluster was bumped from 25.12 to 26.5 in 35881bb, but the inline comment on the third (current-build) node still mentioned 25.12. Update it so the rolling-upgrade rationale stays consistent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The existing pipeline-shape checks (`num_sources >= 4`) caught the "splits aren't created" regression but would silently pass through a "split assignment drops or duplicates ranges" regression that kept the shape intact. Add `count() = 1000000, sum(a) = 499999500000` against the known data baseline for both `ORDER BY a` and `ORDER BY a DESC` so the test fails immediately under coverage holes or amplification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two stale mentions left over from the 25.12 → 26.5 bump:
- Module-level rationale: the parenthetical "(25.12 has PR=5 and is
excluded by ...)" is rewritten to talk about the disconnect gate
generically — the specific 25.12 detour shouldn't outlive its
relevance to anyone reading the test fresh.
- `test_split_topology_rolling_upgrade` docstring: "The 25.12 peer"
→ "The 26.5 peer" to match the actual `tag=`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…packet 14 - Feature table `VERSIONED_PARALLEL_REPLICAS_PROTOCOL`: current value bumped from `7` to `8`, with a paragraph describing what version `8` adds (`MergeTreeAllRangesAnnouncementResponse` and the initiator-replies-to-followers contract). - `ServerHello.parallel_replicas_protocol_version` and `Addendum.parallel_replicas_protocol_version`: current value bumped `7` → `8` so the canonical-value column matches `DBMS_PARALLEL_REPLICAS_PROTOCOL_VERSION`. - Client → Server packet table: add packet `14` `MergeTreeAllRangesAnnouncementResponse`, body `not specified` to match the convention used by sibling parallel-replicas packets, with a description that explains the version gate and that it's inter-server only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a dedicated `MergeTreeAllRangesAnnouncementResponse` section under
"Message reference" describing:
- When the packet flies (`parallel_replicas_protocol_version ≥ 8` AND
the originating announcement's `mode` is non-`Default`; `Default`
mode stays fire-and-forget).
- The three top-level body fields (`version`, `parts`, `stream_id`)
and how `version` falls back below the
`DBMS_MIN_REVISION_WITH_VERSIONED_PARALLEL_REPLICAS_PROTOCOL` TCP
revision.
- The `RangesInDataPartsDescription` and `RangesInDataPartDescription`
wire formats with their gates (`MIN_VERSION_WITH_PROJECTION` v5,
`MIN_VERSION_WITH_MIN_MARKS_PER_TASK` v6).
- The `MergeTreePartInfo` and `MarkRanges` byte layouts including the
little-endian / VarUInt / boolean-text quirks.
Link the Client → Server packet-table row and the feature-table entry to
the new body section so the canonical spec covers everything needed to
implement or validate the v8 inter-server packet.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…up packet direction The response-flow narrative incorrectly said the follower issues `MergeTreeReadTaskResponse` (client packet `10`) after the announcement-response, which is the response side — i.e. the initiator's reply, on the wrong endpoint. The follower actually sends `MergeTreeReadTaskRequest` (server packet `16`, follower→initiator) and the initiator replies with `MergeTreeReadTaskResponse` (client packet `10`). Correct the wording and spell out both packet roles so third-party native clients don't implement the response side on the wrong endpoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…egisteredParts The two comments around `getRegisteredParts` / `stream_to_registered_parts` suggested that `InOrderCoordinator` drops covered/covering parts during normalization of a single announcement, making the captured working set a subset of the announcement payload. That's wrong: within a single MergeTree replica's announcement parts are non-overlapping by construction, and the cover/covered branches in `doHandleInitialAllRangesAnnouncement` only ever deduplicate across replicas (which, in the snapshot-pinned topology this PR adds, is moot because the snapshot replica is the first announcer). Rewrite both comments to describe what's actually happening: the working set equals the first announcement one-to-one, and we capture it via `getRegisteredParts` (rather than from the announcement directly) only to keep the lookup independent of any future per-stream coordinator that may post-process its input. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
devcrafter
approved these changes
Jun 29, 2026
|
|
||
| MergeTreeReadPoolPtr pool; | ||
|
|
||
| /// Authoritative set of (parent_info, projection_name) reported by the coordinator for this |
Member
There was a problem hiding this comment.
Could we simplify the comment? AI makes it harder to perceive.
1 task
Under some CI configs the loop-insert pattern (one `INSERT` per node into a `ReplicatedMergeTree`) ended up with more than 1M rows in `ts` because replication timing left the table somewhere between 1M and 3M rows when queries started. `sum()` and `LIMIT` queries survived that (parallel replicas dedupes reads, so `sum` looks correct at 1x and `LIMIT 10` is idempotent), but the new `count()` assertion caught the extra rows and failed on amd_msan / amd_asan_ubsan / arm_binary variants with `count=20` instead of `10` per group. Insert exactly once from `split_topology_nodes[0]` and `SYSTEM SYNC REPLICA` the other two, then `OPTIMIZE FINAL` everywhere, so every replica sees the same 1M rows regardless of scheduler timing. Fixes: 4x reports on https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101434&sha=489aca864d0395f291e59e15f93d7a36e7986338&name_0=PR Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 392/421 (93.11%) · Uncovered code |
alexey-milovidov
added a commit
that referenced
this pull request
Jul 2, 2026
…o-read-setting
The branch was ~11.7k commits behind (last merge 2026-06-20) and red. The
diff vs. `master` is still the single randomizer line — no functional change:
"parallel_replicas_min_number_of_rows_per_replica": lambda: random.randint(0, 1),
Reason to re-merge now: the parallel-replicas owners have landed several
directly relevant fixes since 2026-06-20, so refreshing CI on today's
`master` will narrow the remaining set of genuine product bugs this PR is
meant to surface:
- #108451 (`Fix NOT_FOUND_COLUMN_IN_BLOCK for virtual columns under parallel
replicas`, closes #106561) — should clear the tracked
`04098_asterisk_include_virtual_columns_mergetree` failure.
- #101434 (`Reimplement reading in order for parallel replicas`) — bears
directly on the `max_rows_to_read`-not-honored class
(`02155_read_in_order_max_rows_to_read`, `00945_bloom_filter_index`).
- #109003 (`Fix server abort on GROUPING SETS in a set operation with
parallel replicas`) — a "Server died" class fix.
- Flaky-test fix for `04051_pk_analysis_stats`.
Conflicts were all in files the branch does not intentionally change (its
only intended change is the one `tests/clickhouse-test` line); they were
resolved by taking `master`'s version. No pinning/blacklisting of the
affected parallel-replicas tests — that would mask the very signal this PR
exists to produce.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Reading in-order with Parallel Replicas now uses the same logic of splitting the table into
max_threadsparts as the local reading for better parallelism.Private test results.
Claude PR summary
Version info
26.7.1.426