Fix false-positive chassert in paranoidCheckForCoveredPartsInZooKeeperOnStart#102864
Fix false-positive chassert in paranoidCheckForCoveredPartsInZooKeeperOnStart#102864groeneai wants to merge 2 commits into
Conversation
…rOnStart The paranoid check on startup verifies that all ZK parts covered by another ZK part either exist on disk or are in the parts_to_fetch list. However, it does not check whether a local active part already covers the ZK part. This legitimately happens after a merge or mutation: the covering part is active locally, the covered part was cleaned from disk by clearOldPartsAndRemoveFromZK, but the covered part's ZK entry has not been cleaned up yet (ZK cleanup can lag behind disk cleanup, or the server was restarted between the two). The data is preserved in the local covering part, and the stale ZK entry will be removed by the cleanup thread after startup. Without this fix, the chassert(false) fires as a false positive in debug/sanitizer builds during stress tests, particularly in the createLogEntriesToFetchBrokenParts() call path (STID: 2508-5dc3, 2508-6644). Prior fixes PR ClickHouse#96164, ClickHouse#96705, and ClickHouse#100652 addressed other false-positive scenarios in this function but missed the case where a local active part covers the ZK part. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
cc @tavplubix @alesapin — could you review this? It fixes a false-positive chassert(false) in paranoidCheckForCoveredPartsInZooKeeperOnStart that crashes debug/sanitizer builds during stress tests (14+ hits in 30 days, STID 2508-5dc3 and 2508-6644). The fix adds a getActiveContainingPart() check so the assertion doesn't fire when a local active part already covers the stale ZK entry — a legitimate post-merge state that prior fixes (#96164, #96705, #100652) didn't handle. |
|
Workflow [PR], commit [221f150] Summary: ❌
AI ReviewSummaryThis PR fixes a debug/sanitizer-only false-positive assertion in ClickHouse Rules
Final Verdict
|
@groeneai, this is not true. Since the local part is not in zk, this part is going to be removed as unexpected |
|
@groeneai, continue on this PR according to the reviewer feedback. |
Per @tavplubix's feedback, replace the `getActiveContainingPart()` check with a direct disk-existence check on `covering_part`. The covering part is a ZooKeeper part by construction (it comes from `active_set` built from `parts_in_zk`), so this addresses his concern that a local-only part is treated as `unexpected` at load time and detached as `ignored` later in `checkPartsImpl`. Why the previous check was nonetheless safe: `getActiveContainingPart` returns from `data_parts_by_state_and_info`, which only contains parts loaded as `expected` (in ZooKeeper at load time, per `StorageReplicatedMergeTree.cpp:507-549,561` populating `expected_parts_on_this_replica` and `MergeTreeData.cpp:2330-2333` splitting parts on disk into expected vs unexpected at load). Parts on disk not in ZK at load go to a separate `unexpected_data_parts` vector and are detached as `ignored` (`StorageReplicatedMergeTree.cpp:2047`); they never enter `data_parts_by_state_and_info` and thus are not returned by `getActiveContainingPart`. So the original fix could not be fooled by an unexpected part. The new check is nonetheless cleaner and more direct: if the ZK covering part is on disk, the covered part's data is provably preserved (covering parts contain all data of the parts they cover by definition). ClickHouse#102864
|
@tavplubix thanks for the pushback — you're right that a local part not in ZooKeeper would be detached as Why the original
|
| Part | In ZK? | On disk? |
|---|---|---|
covering_part (e.g. all_0_2_1) |
yes | yes (active local) |
part_name (e.g. all_0_0_0) |
yes — stale entry, cleanup thread hasn't run | no — cleaned |
part_name (e.g. all_1_1_0) |
yes — stale entry | no — cleaned |
Old logic: part_name not on disk, not in parts_to_fetch (because getActiveContainingPart(part_name) returns the local copy of covering_part, so it's excluded from the fetch list at the call site) → chassert(false) fires.
New logic: same as above, then we additionally check whether covering_part exists on disk. It does, because the merge result is still there — so we take the legitimate-state branch, log a WARNING, and skip the chassert.
True data loss path is unchanged: if the covering part is genuinely missing on disk, found stays false and the chassert fires.
Pre-PR validation gate (session: cron:clickhouse-ci-task-worker:20260425-161500)
a) Deterministic repro? Same startup race as before (14+ CI hits in 30 days, STIDs 2508-5dc3 / 2508-6644). No new local repro for this iteration; the change is a strict tightening of the guard.
b) Root cause explained? Yes — the chassert did not consider the case where the ZK covering part exists on disk. parts_to_fetch cannot reach it because the call site excludes covered parts whose covering part is locally present (see ReplicatedMergeTreeQueue.cpp:243-246 and StorageReplicatedMergeTree.cpp:1888-1891).
c) Fix matches root cause? Yes — directly checks the missing case using the ZK covering part name.
d) Test intent preserved? Yes — the assertion still fires for genuine data loss (no part_name on disk, no covering_part on disk, not in parts_to_fetch).
e) Both directions demonstrated? Without the fix: 14+ CI master hits over 30 days. With the fix: CI on this PR.
f) Fix is general, not a narrow patch? Yes — fix is inside paranoidCheckForCoveredPartsInZooKeeperOnStart itself, applying to both call sites (checkPartsImpl and createLogEntriesToFetchBrokenParts).
LLVM Coverage Report
Changed lines: 16.67% (3/18) | lost baseline coverage: 26 line(s) · Uncovered code |

The paranoid check on startup verifies that all ZK parts covered by another ZK part either exist on disk or are in the
parts_to_fetchlist. However, it missed a third legitimate case: a local active part already covers the ZK part.This happens after a merge or mutation: the covering part is active locally, the covered part was cleaned from disk by
clearOldPartsAndRemoveFromZK, but the covered part's ZK entry has not been cleaned up yet (ZK cleanup can lag behind disk cleanup, or the server was restarted between the two). The data is preserved in the local covering part, and the stale ZK entry will be removed by the cleanup thread after startup.Without this fix, the
chassert(false)fires as a false positive in debug/sanitizer builds during stress tests, specifically in thecreateLogEntriesToFetchBrokenParts()→paranoidCheckForCoveredPartsInZooKeeperOnStart()call path.CI evidence (STID: 2508-5dc3, 2508-6644):
createLogEntriesToFetchBrokenParts→paranoidCheckForCoveredPartsInZooKeeperOnStart→chassert(false)at line 1839parts_to_fetchpassed fromcreateLogEntriesToFetchBrokenPartsFix: Before asserting, check if
getActiveContainingPart(part_name)returns a non-null result. If a local active part covers the ZK part, the data is safe and the assertion should not fire. The assertion is preserved for genuine data loss scenarios (no local covering part, no disk copy, not being fetched).Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix a false-positive
Logical error: 'false'assertion inparanoidCheckForCoveredPartsInZooKeeperOnStartthat could crash the server duringReplicatedMergeTreestartup in debug/sanitizer builds. The check now correctly recognizes that a covered ZK part whose data is preserved in a local active covering part is a legitimate state, not a data loss scenario.Documentation entry for user-facing changes