Fix parallel replicas crash with lazy remote source by groeneai · Pull Request #101154 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix parallel replicas crash with lazy remote source#101154

Merged
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix/parallel-replicas-lazy-shard-num-crash
Apr 10, 2026
Merged

Fix parallel replicas crash with lazy remote source#101154
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix/parallel-replicas-lazy-shard-num-crash

Conversation

@groeneai

@groeneai groeneai commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

N/A

What is the problem?

STID 5066: ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas setting override that addPipe() has. When the lazy remote source path is used (forced by the use_delayed_remote_source failpoint enabled in stress tests by #100682), the _shard_num scalar from the distributed execution context leaks into prepareClusterForParallelReplicas() with a value that exceeds the parallel replicas cluster's shard count.

For example, shard 2 of a 2-shard distributed cluster sets _shard_num=2, but cluster_for_parallel_replicas (typically 1-shard with N replicas) only has 1 shard — causing a LOGICAL_ERROR crash.

This regression started on March 29, 2026 when #100682 enabled the use_delayed_remote_source failpoint in stress tests. 30+ crashes across 22 unrelated PRs + 4 master hits in 2 days.

Related issue: #81738

How was it fixed?

Added the same cluster_for_parallel_replicas override to ReadFromRemote::addLazyPipe() that already exists in ReadFromRemote::addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths.

The regression test uses no-parallel tag because the use_delayed_remote_source failpoint is global — concurrent tests could trigger "Unexpected lazy remote read from a non-replicated table" crashes (same pattern as test 02863).

Note: Changed from "Bug Fix" to "CI Fix" because the bug only triggers via the use_delayed_remote_source failpoint which is debug-only (disabled in release builds).

How was the fix verified?

  1. Reproduction: Enabling failpoint + distributed table over 2-shard cluster + parallel replicas → crashes before fix, passes after
  2. Regression test: 10/10 passes locally with randomization
  3. CI on previous iteration: 198,911 test results with zero failures on commit 919278fc

Version info

  • Merged into: 26.4.1.762

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @nickitat @zlareb1 — could you review this? This fixes a P0 LOGICAL_ERROR crash in parallel replicas that was triggered by the use_delayed_remote_source failpoint enabled in stress tests by #100682. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(). 14+ crashes across 13 unrelated PRs in the last 2 days.

@clickhouse-gh

clickhouse-gh Bot commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [5cd1c6c]


AI Review

Summary

This PR fixes a real correctness issue in ReadFromRemote: addLazyPipe now applies the same cluster_for_parallel_replicas override as addPipe, preventing _shard_num/cluster mismatch exceptions when the lazy remote source path is forced by failpoint. The change is surgical, aligns both code paths, and includes a targeted stateless regression test that reproduces the prior exception scenario. I did not find blocker or major issues in the patch.

Missing context
  • ⚠️ CI is still in progress, so final validation status for all jobs/logs is not yet available.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 30, 2026
@zlareb1

zlareb1 commented Mar 30, 2026

Copy link
Copy Markdown
Member

@groeneai can you give list of PRs?

@groeneai

Copy link
Copy Markdown
Contributor Author

@zlareb1 Here's the full list from CIDB. 30 crash instances across 22 distinct PRs + master in the last 30 days (all since PR #100682 enabled the use_delayed_remote_source failpoint on March 29):

Master (4 hits):

Time (UTC) Check Commit
2026-03-30 04:44 Stress test (arm_ubsan) d51252c
2026-03-30 04:36 Stress test (amd_debug) d51252c
2026-03-30 02:13 Stress test (arm_asan_ubsan) 696456d
2026-03-30 02:13 Stress test (arm_asan_ubsan, s3) 696456d

PRs (26 hits across 22 PRs):

PR Check(s) Latest hit
#101134 amd_msan Mar 29 23:52
#101126 amd_debug Mar 30 05:00
#101111 arm_tsan, amd_msan Mar 30 03:25
#101110 amd_tsan Mar 29 21:46
#101073 amd_debug Mar 29 21:31
#101069 amd_debug Mar 30 00:54
#101066 amd_msan Mar 29 18:26
#101051 amd_msan Mar 29 22:14
#101048 amd_debug Mar 30 04:20
#101038 arm_asan_ubsan (s3), amd_msan Mar 29 18:06
#101027 amd_msan Mar 30 06:31
#101023 amd_tsan Mar 29 19:58
#101006 amd_msan Mar 30 00:49
#100975 arm_ubsan Mar 30 03:13
#100972 arm_tsan, amd_msan Mar 30 04:42
#100941 arm_release, arm_debug Mar 30 05:24
#100606 amd_debug Mar 30 03:45
#100604 amd_debug Mar 30 05:33
#100376 amd_tsan Mar 30 05:20
#99721 amd_msan Mar 30 03:10
#99513 amd_debug Mar 29 22:26
#96802 amd_msan Mar 29 19:01

All PRs are completely unrelated to each other and to parallel replicas — the crash is a pre-existing trunk bug exposed by the use_delayed_remote_source failpoint from #100682.

CIDB query used:

SELECT pull_request_number, check_name, check_start_time, commit_sha
FROM default.checks
WHERE test_name LIKE '%STID: 5066%'
  AND test_status IN ('FAIL', 'ERROR')
  AND check_start_time > now() - INTERVAL 30 DAY
ORDER BY check_start_time DESC

@nickitat nickitat added the can be tested Allows running workflows for external contributors label Mar 30, 2026
@devcrafter devcrafter self-assigned this Mar 30, 2026
@pufit

pufit commented Mar 30, 2026

Copy link
Copy Markdown
Member

@groeneai fast tests failed

@groeneai groeneai force-pushed the fix/parallel-replicas-lazy-shard-num-crash branch from 072fdec to eb41921 Compare March 30, 2026 16:21
@groeneai

Copy link
Copy Markdown
Contributor Author

@pufit Fixed, thanks for flagging. Two issues:

  1. Cluster: test_cluster_two_shard_three_replicas_localhost uses 127.0.0.2127.0.0.6 which aren't reachable in Fast test. Switched to test_cluster_two_shards_localhost (both shards on 127.0.0.1). This still exercises the bug — the crash is triggered by _shard_num = 2 from the distributed execution, regardless of which addresses the shards use.

  2. Output: The exact sum() result depends on whether Keeper is available for parallel replicas coordination. Without Keeper, parallel replicas falls back to normal distributed execution (2 shards × same local table = double sum). Changed to FORMAT Null + SELECT 'OK' since the test's purpose is verifying no crash, not checking exact values.

Verified 20/20 locally.

@groeneai groeneai force-pushed the fix/parallel-replicas-lazy-shard-num-crash branch from eb41921 to 919278f Compare March 30, 2026 23:34
@clickhouse-gh clickhouse-gh Bot added the pr-ci label Mar 30, 2026
@groeneai

Copy link
Copy Markdown
Contributor Author

@pufit @nickitat @zlareb1 — Pushed a fix for the STID 5798 regression ("Unexpected lazy remote read from a non-replicated table" server crashes across all flaky checks).

Root cause: The use_delayed_remote_source failpoint is global — when test 04069 enables it, ALL concurrent distributed queries force lazy=true in SelectStreamFactory (including queries on non-replicated MergeTree tables). If the failpoint is disabled before the lazy lambda executes, the lambda hits the !use_delayed_remote_source branch and crashes because the storage isn't StorageReplicatedMergeTree.

This is the same race condition that test 02863_delayed_source_with_totals_and_extremes already handles with -- Tags: no-parallel.

Fixes (commit 919278fc8c1):

  1. Added no-parallel tag to test 04069 (prevents failpoint from affecting concurrent tests)
  2. Replaced the LOGICAL_ERROR assertion with a graceful fallback to local plan for non-replicated tables in the lazy path — since delay-based replica selection doesn't apply to non-replicated tables, createLocalPlan() is the correct behavior (matches what SelectStreamFactory::createForShardImpl() does at line 247)

Also changed changelog category to "CI Fix or Improvement" since the bug only triggers via the debug-only failpoint (bugfix validation can't reproduce against release binary).

Session: cron:clickhouse-ci-task-worker:20260330-231500

@clickhouse-gh clickhouse-gh Bot removed the pr-bugfix Pull request with bugfix, not backported by default label Mar 30, 2026
@devcrafter

devcrafter commented Mar 31, 2026

Copy link
Copy Markdown
Member

@groeneai
We shouldn't do this :

Replaced the LOGICAL_ERROR assertion with a graceful fallback to local plan for non-replicated tables in the lazy path — since delay-based replica selection doesn't apply to non-replicated tables, createLocalPlan() is the correct behavior (matches what SelectStreamFactory::createForShardImpl() does at line 247)

Adding no-parallel tag to test should be enough

@groeneai groeneai force-pushed the fix/parallel-replicas-lazy-shard-num-crash branch from 919278f to 975fdb9 Compare March 31, 2026 12:20
@groeneai

Copy link
Copy Markdown
Contributor Author

@devcrafter You're right — reverted the graceful fallback. The no-parallel tag is sufficient to prevent the race.

Pushed a clean single commit (975fdb909) that now contains only:

  1. The cluster_for_parallel_replicas override in addLazyPipe() (the actual STID 5066 fix)
  2. Regression test with no-parallel tag

The createLocalPlan fallback for non-replicated tables is removed — the original LOGICAL_ERROR assertion stays.

@alexey-milovidov

Copy link
Copy Markdown
Member

@groeneai Investigate this flaky test failure: 02377_extend_protocol_with_query_parameters and fix it as well:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101154&sha=975fdb909a96ac079fbc750c2a4b078e3dda6a9c&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29

Also, send a separate PR with that fix.

@devcrafter devcrafter left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now

@groeneai

Copy link
Copy Markdown
Contributor Author

@alexey-milovidov Investigated and fixed in a separate PR: #101437

Root cause: When CI randomizes map_serialization_version to with_buckets, Map keys are distributed into hash buckets which changes iteration order on readback. The test inserts into a MergeTree table with Map columns and expects a specific key order — but bucketed serialization doesn't preserve insertion order.

CIDB evidence: 10 failures across 7 unrelated PRs in 30 days (amd_tsan, amd_asan_ubsan, amd_msan, amd_debug, arm_binary, amd_llvm_coverage).

Fix: Pin map_serialization_version='basic' in the test's CREATE TABLE SETTINGS. Targeted — the test validates query parameter passing, not Map serialization.

@antaljanosbenjamin antaljanosbenjamin mentioned this pull request Apr 1, 2026
1 task
@devcrafter devcrafter enabled auto-merge April 1, 2026 14:03
@azat azat disabled auto-merge April 2, 2026 11:08
@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

@alexey-milovidov

alexey-milovidov commented Apr 9, 2026

Copy link
Copy Markdown
Member

@groeneai, Tests for Azure were fixed in #100980
Please update the branch to include the changes.

ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas
override that addPipe() has. When the use_delayed_remote_source failpoint
forces the lazy path, the _shard_num scalar from the distributed execution
leaks into prepareClusterForParallelReplicas with a value exceeding the
parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash.

This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source
failpoint in stress tests, causing 30+ crashes across unrelated PRs.

The fix adds the same cluster_for_parallel_replicas override to addLazyPipe()
that already exists in addPipe(), ensuring the parallel replicas cluster
matches the distributed table's cluster in both code paths.

The regression test uses no-parallel tag because the use_delayed_remote_source
failpoint is global — concurrent tests could trigger "Unexpected lazy remote
read from a non-replicated table" crashes (same pattern as test 02863).

Closes ClickHouse#81738

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@groeneai groeneai force-pushed the fix/parallel-replicas-lazy-shard-num-crash branch from 78ed03d to 56d7fd0 Compare April 9, 2026 04:48
@groeneai

groeneai commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

Rebased onto latest master (includes PR #100980 Azure test fixes). Single clean commit on top of 4d79f6ea35f.

alexey-milovidov added a commit that referenced this pull request Apr 9, 2026
Deserialization of complex column types (Map, Nested, Array) may add
more rows than requested to a column due to SubstreamsCache interactions.
When one column's substream data is cached and reused by another column,
the cached data may cover more rows than `max_rows_to_read`.

This caused `MergeTreeRangeReader::adjustLastGranule` to hit an unsigned
underflow (`num_read_rows > total_rows_per_granule`), producing the
exception: "Can't adjust last granule because it has N rows, but try to
subtract 18446744073709551615 rows". The bug manifested under MSan/TSan
sanitizer builds in stress tests with FINAL queries on Wide MergeTree
parts containing Map subcolumns.

The fix truncates excess rows from any column that grew beyond the
requested limit during deserialization, maintaining the invariant that
all other `IMergeTreeReader` implementations already uphold.

Fixes #100769
Fixes #101336

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101154&sha=78ed03d7d18d550a65a36d7ad07d3bab7f7dda3b&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20s3%20storage%2C%20parallel%2C%201%2F2%29
See also: #101154

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

The Can't adjust last granule error in CI is a known issue. The fix is in #101641

@clickhouse-gh

clickhouse-gh Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 25.00% (5/20) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very reasonable.

@alexey-milovidov alexey-milovidov merged commit f0a26a7 into ClickHouse:master Apr 10, 2026
158 of 161 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants