iframe-proxy

nickitat · 2026-03-31T21:29:24Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Reading in-order with Parallel Replicas now uses the same logic of splitting the table into max_threads parts as the local reading for better parallelism.

Private test results.

Claude PR summary

  1. Protocol (src/Core/Protocol.h, ProtocolDefines.h, MergeTree/RequestResponse.{h,cpp})

  - New client packet MergeTreeAllRangesAnnouncementResponse (= 14), protocol bumped to DBMS_PARALLEL_REPLICAS_PROTOCOL_VERSION = 8 with gate
  DBMS_PARALLEL_REPLICAS_MIN_VERSION_WITH_ANNOUNCEMENT_RESPONSE.
  - New struct InitialAllRangesAnnouncementResponse { parts, stream_id }: initiator's reply to a follower's announcement, carrying the authoritative parts list for that stream.
  - MergeTreeAllRangesCallback signature changes from void(...) → std::optional<...Response>(...). nullopt = old initiator, no response sent. Engaged-but-empty parts = stream
  doesn't exist on the coordinator (over-announce) → follower's pool finishes immediately.

  2. Coordinator (ParallelReplicasReadingCoordinator.{h,cpp})

  - handleInitialAllRangesAnnouncement now returns InitialAllRangesAnnouncementResponse.
  - New setSnapshotReplicaNum(replica_num): lets the initiator pin itself as the snapshot replica BEFORE any announcement arrives. Once pinned, only the snapshot replica can
  create a new stream coordinator; announcements from other replicas for unknown streams are silently dropped (empty parts returned).
  - Per-stream getRegisteredParts() virtual on ImplInterface: the post-normalization working set (InOrderCoordinator drops covered/covering parts at normalization time). Cached
  in stream_to_registered_parts on first announcement and echoed back to every subsequent announcer.
  - InOrderCoordinator::isReadingCompleted() newly implemented; pure virtual on the base.

  3. Planner (ReadFromMergeTree.{h,cpp}, ParallelReplicasLocalPlan.cpp, ClusterProxy/executeQuery.{h,cpp})

  - spreadMarkRangesAmongStreamsWithOrder refactored into three branches gated by isParallelReplicasLocalPlanForInitiator / isParallelReplicasLocalPlanForFollower:
    - Initiator + local plan: split ranges into num_streams per-stream pools (genuine splitting). Each split gets its own stream_id = <table>#split_{i}.
    - Follower + local plan: construct num_streams pools, each over ALL local parts. Each pool's announcement gets back from the coordinator the authoritative sub-set; non-owned
  parts are filtered out during source construction so no phantom consumers are built. Streams owning no parts on this follower produce empty pipes and are dropped.
    - parallel_replicas_local_plan=0: preserved old single-pool behavior.
  - make_per_split_pool_settings: divides pool_settings.threads evenly across splits to avoid min_marks_per_request = min_marks_per_task × threads being inflated num_splits-fold.
  - ParallelReplicasLocalPlan::createLocalPlanForParallelReplicas calls coordinator->setSnapshotReplicaNum(replica_number) before any announcement is sent.
  - New helper ClusterProxy::canUseLocalPlanForParallelReplicas(context): extracts the 4-clause check that gates the local-plan branch in executeQueryWithParallelReplicas
  (analyzer + parallel_replicas_local_plan + parallel_replicas_prefer_local_replica + _shard_num == 0). Now also called by ReadFromMergeTree so followers keep lock-step with the
  initiator's topology decision. This is the fix in the last two commits — without it, followers inside a Distributed sub-query took the 32-pool over-announce path while the
  initiator skipped local plan entirely, causing every split's coordinator to register the full part view and amplify reads ~`num_streams`×.

  4. Read pools (MergeTreeReadPoolParallelReplicas{,InOrder}.{h,cpp}, MergeTreeReadPoolBase.{h,cpp})

  - MergeTreeReadPoolBase::buildAnnouncementDescriptions(): lifted from the InOrder pool to be shared (fills in per-part min_marks_per_task). The announcement is now sent by
  ReadFromMergeTree directly (not the pool constructor), so the response can be consumed at the call site.
  - MergeTreeReadPoolParallelReplicasInOrder:
    - Steady-state task size now capped at min_marks_per_task (matching the Default pool) rather than max_block_size / index_granularity — fixes a long-standing over-small task
  issue while keeping warmup growth for early-LIMIT termination.
    - Response matching is now keyed by (part_info, projection_name) instead of by position, since coordinators can return parts in arbitrary order.
  - ParallelReadingExtension::sendInitialRequest returns the response. Split into sendReadRequest (Default) / sendReadInOrderRequest (positional → keyed).

  5. Connection plumbing

  - IServerConnection / IConnections / Connection / MultiplexedConnections / LocalConnection / HedgedConnections: new sendMergeTreeAllRangesAnnouncementResponse virtual + impl.
  - TCPHandler: the announcement callback now blocks for the response (when client protocol ≥ 8) via new receiveAllRangesAnnouncementResponse. Errors set stop_query = true and
  rethrow.
  - RemoteQueryExecutor::processMergeTreeInitialReadAnnouncement: relays the coordinator's response back to the announcing replica.

  6. Tests

  - New 04073_parallel_replicas_in_order_splits.sql: verifies that a single part is genuinely split into ≥ max_threads MergeTreeSelect sources for both WithOrder and
  ReverseOrder.
  - New integration test test_parallel_replicas_protocol (separate fixture under tests/integration/).
  - Modified 00177_memory_bound_merging.sh (test4): runs the in-order query with count() (which amplifies under work duplication, unlike the idempotent max(URL) in test1/test2)
  and compares against a single-node baseline. With max_threads=16 and parallel_replicas_local_plan=1 pinned so random-settings can't mask the regression. This catches the
  over-announce bug that the last two commits fix.
  - Reference updates in 02404_memory_bound_merging, 02883_parallel_replicas_join_algo_and_analyzer_{2,3}, 03222_parallel_replicas_memory_bound_merging_projection,
  03724_parallel_replicas_duplicate_requests for the new split topology.

  ---
  Suggested review order

  1. Protocol + coordinator (sections 1–2) — sets the contract.
  2. Planner (section 3) — most of the new code; the three-branch split in spreadMarkRangesAmongStreamsWithOrder plus the snapshot pinning are the heart of the change.
  2a. The last two commits in isolation (0d3f0456c06, a92a094ca72) — small, self-contained canUseLocalPlanForParallelReplicas factoring + regression test.
  3. Read pool + connection plumbing (sections 4–5).
  4. Tests (section 6) — 04073 is the cleanest functional spec of the new behavior.

Version info

Merged into: 26.7.1.426

clickhouse-gh · 2026-03-31T21:30:08Z

Workflow [PR], commit [e759052]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info	comment
AST fuzzer (amd_debug, targeted)		FAIL
	Logical error: Bad cast from type A to B (STID: 2682-3e1c)	FAIL	cidb

AI Review

Summary

This PR reworks in-order parallel-replica reads to use split streams, adds the announcement-response protocol needed to align followers with the initiator's topology, and updates the protocol spec/tests accordingly. The main correctness issues from earlier rounds look addressed, but one follower-side scalability problem is still present in the current code, and the PR's current benchmark signal is still negative on the path it claims to improve, so I would not approve it yet.

Findings

⚠️ Majors

[src/Processors/QueryPlan/ReadFromMergeTree.cpp:1568] The follower-side parallel_replicas_local_plan = 1 path still creates num_streams independent pools from full copies of all_parts_for_replicas, and each pool materializes full per-part state plus a full initial announcement before the coordinator prunes it. That keeps setup cost proportional to max_threads * parts even when most splits are later discarded, so large-part-count queries pay the blow-up up front. The current Cloud Performance Report on this PR is consistent with that cost still leaking into user-visible behavior (tpch Q4 +192.7%, Q7 +793.7%, Q8 +25.0%). Suggested fix: make follower split setup lightweight until the announcement response arrives, or otherwise avoid copying/sending the full part list once per split.

Tests

⚠️ This PR is tagged as Performance Improvement, but the current benchmark evidence on the PR still shows regressions on the touched in-order path. Please rerun focused before/after measurements after fixing the follower setup path, or explain why the reported tpch regressions are acceptable.

Final Verdict

Status: ⚠️ Request changes

Minimum required actions:

Remove or justify the num_streams * parts follower setup blow-up in the split-stream local-plan path.
Provide benchmark evidence showing that the in-order parallel-replica path is at least performance-neutral before keeping the Performance Improvement claim.

alexey-milovidov · 2026-04-07T00:28:47Z

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

`all_parts_for_replicas = parts_with_ranges` previously ran for every non-initiator, including `parallel_replicas_local_plan=0`. The legacy single-pool branch consumes the list once with `std::move` — the copy was a pure full-vector waste on the in-order parallel-replica hot path. Only the local-plan follower path actually needs the copy (each split reads from a copy and filters down to its assigned subset). For the single-pool branch, move `parts_with_ranges` directly into the only pool that consumes it: that path never reached the split builder, so `parts_with_ranges` is still intact and there's no second consumer to worry about. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The cluster was bumped from 25.12 to 26.5 in 35881bb, but the inline comment on the third (current-build) node still mentioned 25.12. Update it so the rolling-upgrade rationale stays consistent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The existing pipeline-shape checks (`num_sources >= 4`) caught the "splits aren't created" regression but would silently pass through a "split assignment drops or duplicates ranges" regression that kept the shape intact. Add `count() = 1000000, sum(a) = 499999500000` against the known data baseline for both `ORDER BY a` and `ORDER BY a DESC` so the test fails immediately under coverage holes or amplification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two stale mentions left over from the 25.12 → 26.5 bump: - Module-level rationale: the parenthetical "(25.12 has PR=5 and is excluded by ...)" is rewritten to talk about the disconnect gate generically — the specific 25.12 detour shouldn't outlive its relevance to anyone reading the test fresh. - `test_split_topology_rolling_upgrade` docstring: "The 25.12 peer" → "The 26.5 peer" to match the actual `tag=`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…packet 14 - Feature table `VERSIONED_PARALLEL_REPLICAS_PROTOCOL`: current value bumped from `7` to `8`, with a paragraph describing what version `8` adds (`MergeTreeAllRangesAnnouncementResponse` and the initiator-replies-to-followers contract). - `ServerHello.parallel_replicas_protocol_version` and `Addendum.parallel_replicas_protocol_version`: current value bumped `7` → `8` so the canonical-value column matches `DBMS_PARALLEL_REPLICAS_PROTOCOL_VERSION`. - Client → Server packet table: add packet `14` `MergeTreeAllRangesAnnouncementResponse`, body `not specified` to match the convention used by sibling parallel-replicas packets, with a description that explains the version gate and that it's inter-server only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add a dedicated `MergeTreeAllRangesAnnouncementResponse` section under "Message reference" describing: - When the packet flies (`parallel_replicas_protocol_version ≥ 8` AND the originating announcement's `mode` is non-`Default`; `Default` mode stays fire-and-forget). - The three top-level body fields (`version`, `parts`, `stream_id`) and how `version` falls back below the `DBMS_MIN_REVISION_WITH_VERSIONED_PARALLEL_REPLICAS_PROTOCOL` TCP revision. - The `RangesInDataPartsDescription` and `RangesInDataPartDescription` wire formats with their gates (`MIN_VERSION_WITH_PROJECTION` v5, `MIN_VERSION_WITH_MIN_MARKS_PER_TASK` v6). - The `MergeTreePartInfo` and `MarkRanges` byte layouts including the little-endian / VarUInt / boolean-text quirks. Link the Client → Server packet-table row and the feature-table entry to the new body section so the canonical spec covers everything needed to implement or validate the v8 inter-server packet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…up packet direction The response-flow narrative incorrectly said the follower issues `MergeTreeReadTaskResponse` (client packet `10`) after the announcement-response, which is the response side — i.e. the initiator's reply, on the wrong endpoint. The follower actually sends `MergeTreeReadTaskRequest` (server packet `16`, follower→initiator) and the initiator replies with `MergeTreeReadTaskResponse` (client packet `10`). Correct the wording and spell out both packet roles so third-party native clients don't implement the response side on the wrong endpoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…egisteredParts The two comments around `getRegisteredParts` / `stream_to_registered_parts` suggested that `InOrderCoordinator` drops covered/covering parts during normalization of a single announcement, making the captured working set a subset of the announcement payload. That's wrong: within a single MergeTree replica's announcement parts are non-overlapping by construction, and the cover/covered branches in `doHandleInitialAllRangesAnnouncement` only ever deduplicate across replicas (which, in the snapshot-pinned topology this PR adds, is moot because the snapshot replica is the first announcer). Rewrite both comments to describe what's actually happening: the working set equals the first announcement one-to-one, and we capture it via `getRegisteredParts` (rather than from the announcement directly) only to keep the lookup independent of any future per-stream coordinator that may post-process its input. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

devcrafter

lgtm

devcrafter · 2026-06-29T13:28:55Z


    MergeTreeReadPoolPtr pool;

+    /// Authoritative set of (parent_info, projection_name) reported by the coordinator for this


Could we simplify the comment? AI makes it harder to perceive.

Under some CI configs the loop-insert pattern (one `INSERT` per node into a `ReplicatedMergeTree`) ended up with more than 1M rows in `ts` because replication timing left the table somewhere between 1M and 3M rows when queries started. `sum()` and `LIMIT` queries survived that (parallel replicas dedupes reads, so `sum` looks correct at 1x and `LIMIT 10` is idempotent), but the new `count()` assertion caught the extra rows and failed on amd_msan / amd_asan_ubsan / arm_binary variants with `count=20` instead of `10` per group. Insert exactly once from `split_topology_nodes[0]` and `SYSTEM SYNC REPLICA` the other two, then `OPTIMIZE FINAL` everywhere, so every replica sees the same 1M rows regardless of scheduler timing. Fixes: 4x reports on https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101434&sha=489aca864d0395f291e59e15f93d7a36e7986338&name_0=PR Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

clickhouse-gh · 2026-07-02T03:33:32Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered: 392/421 (93.11%) · Uncovered code

Full report · Diff report

…o-read-setting The branch was ~11.7k commits behind (last merge 2026-06-20) and red. The diff vs. `master` is still the single randomizer line — no functional change: "parallel_replicas_min_number_of_rows_per_replica": lambda: random.randint(0, 1), Reason to re-merge now: the parallel-replicas owners have landed several directly relevant fixes since 2026-06-20, so refreshing CI on today's `master` will narrow the remaining set of genuine product bugs this PR is meant to surface: - #108451 (`Fix NOT_FOUND_COLUMN_IN_BLOCK for virtual columns under parallel replicas`, closes #106561) — should clear the tracked `04098_asterisk_include_virtual_columns_mergetree` failure. - #101434 (`Reimplement reading in order for parallel replicas`) — bears directly on the `max_rows_to_read`-not-honored class (`02155_read_in_order_max_rows_to_read`, `00945_bloom_filter_index`). - #109003 (`Fix server abort on GROUPING SETS in a set operation with parallel replicas`) — a "Server died" class fix. - Flaky-test fix for `04051_pk_analysis_stats`. Conflicts were all in files the branch does not intentionally change (its only intended change is the one `tests/clickhouse-test` line); they were resolved by taking `master`'s version. No pinning/blacklisting of the affected parallel-replicas tests — that would mask the very signal this PR exists to produce. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Mar 31, 2026

clickhouse-gh Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/ParallelReplicasReadingCoordinator.cpp Outdated

clickhouse-gh Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/ParallelReplicasReadingCoordinator.cpp

clickhouse-gh Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/ParallelReplicasReadingCoordinator.cpp Outdated

clickhouse-gh Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/ParallelReplicasReadingCoordinator.cpp Outdated

nickitat added 8 commits April 1, 2026 21:27

test bw-compatibility for in-order reading too

5873898

prepare to separate requests + cleanup

96cdf19

introduce split_id in protocol

7f69e72

initial support for split_id by the coordinator

a32065d

send multiple splits in announcements

12fe60a

proper handling of is_finished

e83a5c4

move the code around

788842a

add a test

a8cae3d

nickitat force-pushed the read_in_order branch from 0aeeaa9 to a8cae3d Compare April 1, 2026 21:28

clickhouse-gh Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/MergeTreeReadPoolParallelReplicasInOrder.cpp Outdated

fix tests

f7afb6b

clickhouse-gh Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/ParallelReplicasReadingCoordinator.cpp Outdated

Merge branch 'master' into read_in_order

cab5416

clickhouse-gh Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/ParallelReplicasReadingCoordinator.cpp Outdated

alexey-milovidov mentioned this pull request Apr 10, 2026

Fix flaky test 03993_map_subcolumns_small_compact #102350

Merged

1 task

fix test

2f62dfb

clickhouse-gh Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/RequestResponse.h Outdated

nickitat added 2 commits April 14, 2026 20:01

better

b5ddbec

fix

f1d1107

clickhouse-gh Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/Processors/QueryPlan/ReadFromMergeTree.cpp Outdated

better

134a739

clickhouse-gh Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/ParallelReplicasReadingCoordinator.cpp Outdated

better

b38e76b

clickhouse-gh Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread tests/integration/test_backward_compatibility/test_parallel_replicas_protocol.py Outdated

nickitat and others added 3 commits June 22, 2026 17:08

clickhouse-gh Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread src/Core/ProtocolDefines.h

nickitat and others added 4 commits June 22, 2026 18:04

Merge remote-tracking branch 'origin/master' into read_in_order

3fb2d82

clickhouse-gh Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread docs/en/interfaces/specs/NativeProtocol.md Outdated

nickitat and others added 4 commits June 22, 2026 18:42

fix flaky test

4434ced

Merge branch 'master' into read_in_order

6227fcf

devcrafter approved these changes Jun 29, 2026

View reviewed changes

alexey-milovidov mentioned this pull request Jul 1, 2026

Fix "Not-ready Set" exception when buildOrderedSetInplace fails #102192

Draft

1 task

fix comment

489aca8

alexey-milovidov mentioned this pull request Jul 1, 2026

Fix hang in SYSTEM SYNC MERGES with the Manual merge selector #108770

Open

nickitat enabled auto-merge July 1, 2026 18:23

nickitat added this pull request to the merge queue Jul 2, 2026

Merged via the queue into master with commit 7ad18bf Jul 2, 2026
513 of 520 checks passed

nickitat deleted the read_in_order branch July 2, 2026 11:56

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 2, 2026

alexey-milovidov mentioned this pull request Jul 2, 2026

Exclude inactive replicas when sizing parallel replicas #107805

Merged

groeneai mentioned this pull request Jul 2, 2026

Fix LOGICAL_ERROR for unknown stream in parallel replicas coordinator #109202

Closed

alexey-milovidov mentioned this pull request Jul 3, 2026

Enable read_in_order_use_virtual_row by default #106215

Open

1 task


		MergeTreeReadPoolPtr pool;

		/// Authoritative set of (parent_info, projection_name) reported by the coordinator for this

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

nickitat commented Mar 31, 2026 • edited by robot-ch-test-poll Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Version info

Uh oh!

clickhouse-gh Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devcrafter left a comment

Choose a reason for hiding this comment

Uh oh!

devcrafter Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Jul 2, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nickitat commented Mar 31, 2026 •

edited by robot-ch-test-poll

Loading

clickhouse-gh Bot commented Mar 31, 2026 •

edited

Loading