Fix parallel replicas crash with lazy remote source#101154
Conversation
|
cc @nickitat @zlareb1 — could you review this? This fixes a P0 LOGICAL_ERROR crash in parallel replicas that was triggered by the |
|
Workflow [PR], commit [5cd1c6c] AI ReviewSummaryThis PR fixes a real correctness issue in Missing context
ClickHouse Rules
Final Verdict
|
|
@groeneai can you give list of PRs? |
|
@zlareb1 Here's the full list from CIDB. 30 crash instances across 22 distinct PRs + master in the last 30 days (all since PR #100682 enabled the Master (4 hits):
PRs (26 hits across 22 PRs):
All PRs are completely unrelated to each other and to parallel replicas — the crash is a pre-existing trunk bug exposed by the CIDB query used: SELECT pull_request_number, check_name, check_start_time, commit_sha
FROM default.checks
WHERE test_name LIKE '%STID: 5066%'
AND test_status IN ('FAIL', 'ERROR')
AND check_start_time > now() - INTERVAL 30 DAY
ORDER BY check_start_time DESC |
|
@groeneai fast tests failed |
072fdec to
eb41921
Compare
|
@pufit Fixed, thanks for flagging. Two issues:
Verified 20/20 locally. |
eb41921 to
919278f
Compare
|
@pufit @nickitat @zlareb1 — Pushed a fix for the STID 5798 regression ("Unexpected lazy remote read from a non-replicated table" server crashes across all flaky checks). Root cause: The This is the same race condition that test Fixes (commit 919278fc8c1):
Also changed changelog category to "CI Fix or Improvement" since the bug only triggers via the debug-only failpoint (bugfix validation can't reproduce against release binary). Session: cron:clickhouse-ci-task-worker:20260330-231500 |
|
@groeneai
Adding |
919278f to
975fdb9
Compare
|
@devcrafter You're right — reverted the graceful fallback. The Pushed a clean single commit (975fdb909) that now contains only:
The |
|
@groeneai Investigate this flaky test failure: Also, send a separate PR with that fix. |
|
@alexey-milovidov Investigated and fixed in a separate PR: #101437 Root cause: When CI randomizes CIDB evidence: 10 failures across 7 unrelated PRs in 30 days (amd_tsan, amd_asan_ubsan, amd_msan, amd_debug, arm_binary, amd_llvm_coverage). Fix: Pin |
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
ReadFromRemote::addLazyPipe() was missing the cluster_for_parallel_replicas override that addPipe() has. When the use_delayed_remote_source failpoint forces the lazy path, the _shard_num scalar from the distributed execution leaks into prepareClusterForParallelReplicas with a value exceeding the parallel replicas cluster's shard count, causing a LOGICAL_ERROR crash. This was triggered by PR ClickHouse#100682 enabling the use_delayed_remote_source failpoint in stress tests, causing 30+ crashes across unrelated PRs. The fix adds the same cluster_for_parallel_replicas override to addLazyPipe() that already exists in addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths. The regression test uses no-parallel tag because the use_delayed_remote_source failpoint is global — concurrent tests could trigger "Unexpected lazy remote read from a non-replicated table" crashes (same pattern as test 02863). Closes ClickHouse#81738 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
78ed03d to
56d7fd0
Compare
|
Rebased onto latest master (includes PR #100980 Azure test fixes). Single clean commit on top of |
Deserialization of complex column types (Map, Nested, Array) may add more rows than requested to a column due to SubstreamsCache interactions. When one column's substream data is cached and reused by another column, the cached data may cover more rows than `max_rows_to_read`. This caused `MergeTreeRangeReader::adjustLastGranule` to hit an unsigned underflow (`num_read_rows > total_rows_per_granule`), producing the exception: "Can't adjust last granule because it has N rows, but try to subtract 18446744073709551615 rows". The bug manifested under MSan/TSan sanitizer builds in stress tests with FINAL queries on Wide MergeTree parts containing Map subcolumns. The fix truncates excess rows from any column that grew beyond the requested limit during deserialization, maintaining the invariant that all other `IMergeTreeReader` implementations already uphold. Fixes #100769 Fixes #101336 CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101154&sha=78ed03d7d18d550a65a36d7ad07d3bab7f7dda3b&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20s3%20storage%2C%20parallel%2C%201%2F2%29 See also: #101154 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The |
LLVM Coverage ReportChanged lines: 25.00% (5/20) · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
N/A
What is the problem?
STID 5066:
ReadFromRemote::addLazyPipe()was missing thecluster_for_parallel_replicassetting override thataddPipe()has. When the lazy remote source path is used (forced by theuse_delayed_remote_sourcefailpoint enabled in stress tests by #100682), the_shard_numscalar from the distributed execution context leaks intoprepareClusterForParallelReplicas()with a value that exceeds the parallel replicas cluster's shard count.For example, shard 2 of a 2-shard distributed cluster sets
_shard_num=2, butcluster_for_parallel_replicas(typically 1-shard with N replicas) only has 1 shard — causing a LOGICAL_ERROR crash.This regression started on March 29, 2026 when #100682 enabled the
use_delayed_remote_sourcefailpoint in stress tests. 30+ crashes across 22 unrelated PRs + 4 master hits in 2 days.Related issue: #81738
How was it fixed?
Added the same
cluster_for_parallel_replicasoverride toReadFromRemote::addLazyPipe()that already exists inReadFromRemote::addPipe(), ensuring the parallel replicas cluster matches the distributed table's cluster in both code paths.The regression test uses
no-paralleltag because theuse_delayed_remote_sourcefailpoint is global — concurrent tests could trigger "Unexpected lazy remote read from a non-replicated table" crashes (same pattern as test 02863).Note: Changed from "Bug Fix" to "CI Fix" because the bug only triggers via the
use_delayed_remote_sourcefailpoint which is debug-only (disabled in release builds).How was the fix verified?
Version info
26.4.1.762