Fix race between async outdated parts loading and paranoid check on startup by alexey-milovidov · Pull Request #96705 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix race between async outdated parts loading and paranoid check on startup#96705

Merged
alexey-milovidov merged 4 commits into
masterfrom
fix-paranoid-check-assertion
Feb 12, 2026
Merged

Fix race between async outdated parts loading and paranoid check on startup#96705
alexey-milovidov merged 4 commits into
masterfrom
fix-paranoid-check-assertion

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Feb 11, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Fix chassert(false) in paranoidCheckForCoveredPartsInZooKeeperOnStart firing during Stress test (MSan).

Root cause: paranoidCheckForCoveredPartsInZooKeeperOnStart checks for parts that exist in ZooKeeper and are covered by another ZK part but don't exist in the in-memory data_parts set (using getPartIfExists). During startup, outdated (covered) parts are loaded asynchronously by loadOutdatedDataParts. The paranoid check runs in the first (optimistic) call to checkPartsImpl, concurrently with async loading. A covered part may exist on disk but not yet be loaded into data_parts, causing getPartIfExists to return null and the chassert(false) to fire.

The checkParts function already has a retry mechanism: if checkPartsImpl returns false, it calls waitForOutdatedPartsToBeLoaded and retries. But the chassert aborts the process before the retry can happen.

Fix: Skip the paranoid check when incomplete_list_of_outdated_parts is true. The check will run on the second call to checkPartsImpl after all outdated parts have been loaded, which is consistent with how the rest of the function handles incomplete outdated parts (see the guard at line 1920).

Under MSan, everything is slower, making the async loading take much longer and significantly widening this race window.

Stack trace from CI:

Logical error: 'false'
Stack trace:
  DB::abortOnFailedAssertion(String const&)
  DB::StorageReplicatedMergeTree::paranoidCheckForCoveredPartsInZooKeeperOnStart(...)
  DB::StorageReplicatedMergeTree::checkPartsImpl(bool)
  DB::StorageReplicatedMergeTree::checkParts(bool)
  DB::ReplicatedMergeTreeAttachThread::runImpl()

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=7c0839ffc607a2585f6cf49eedd0e0a49c85c452&name_0=MasterCI&name_1=Stress%20test%20%28amd_msan%29

Version info

  • Merged into: 26.2.1.616

…nStart`

The assertion fires when a part exists in ZooKeeper and is covered by
another part in ZooKeeper, but doesn't exist on any disk. This happens
legitimately after merges or mutations: the old part's ZK entry can
outlive its on-disk data if the server is stopped before the cleanup
thread removes the stale ZK entry.

In the stress test (MSan), this caused the server to abort repeatedly
on restart, preventing the post-stress-test check phase from running.

The warning log and `ReplicatedCoveredPartsInZooKeeperOnStart` profile
event are preserved for monitoring.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=7c0839ffc607a2585f6cf49eedd0e0a49c85c452&name_0=MasterCI&name_1=Stress%20test%20%28amd_msan%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Feb 11, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Feb 11, 2026
@alexey-milovidov

alexey-milovidov commented Feb 11, 2026

Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov marked this pull request as draft February 11, 2026 21:23
alexey-milovidov and others added 2 commits February 11, 2026 22:26
When a part exists in ZooKeeper but not on any disk and is covered
by another part in ZooKeeper, proactively remove the stale ZK entry
during startup. This can happen when:

- A ZK multi-op in `checkPartChecksumsAndCommit` registered the part
  in ZK, but the server crashed before the data was flushed to disk.
- A fetch was optimized to download a covering part instead, leaving
  the original's ZK entry orphaned.

Previously, this situation was detected by
`paranoidCheckForCoveredPartsInZooKeeperOnStart` which only logged a
warning and fired `chassert(false)`. The assertion caused the server to
abort repeatedly in debug/sanitizer builds, preventing restart after
stress tests.

Now the function returns the list of stale parts, and `checkPartsImpl`
removes them from ZooKeeper before proceeding. This prevents
false-positive 'part is lost forever' messages and eliminates the crash.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=7c0839ffc607a2585f6cf49eedd0e0a49c85c452&name_0=MasterCI&name_1=Stress%20test%20%28amd_msan%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tartup

The `paranoidCheckForCoveredPartsInZooKeeperOnStart` checks for parts that
exist in ZooKeeper and are covered by another ZK part but don't exist in the
in-memory `data_parts` set. It uses `getPartIfExists` which checks the
in-memory set, not the filesystem directly.

During startup, outdated (covered) parts are loaded asynchronously by
`loadOutdatedDataParts`. The paranoid check runs in the first (optimistic)
call to `checkPartsImpl`, which happens concurrently with async loading.
A covered part may exist on disk but not yet be in `data_parts`, causing
`getPartIfExists` to return null and the `chassert(false)` to fire.

The `checkParts` function already has a retry mechanism: if `checkPartsImpl`
returns false, it calls `waitForOutdatedPartsToBeLoaded` and retries. But
the `chassert` aborts the process before the retry can happen.

The fix: skip the paranoid check when `incomplete_list_of_outdated_parts`
is true. It will run on the second call after all outdated parts are loaded.

This was observed in the Stress test (MSan) where slower execution widens
the race window between async loading and the paranoid check.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=7c0839ffc607a2585f6cf49eedd0e0a49c85c452&name_0=MasterCI&name_1=Stress%20test%20%28amd_msan%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov changed the title Remove chassert in paranoidCheckForCoveredPartsInZooKeeperOnStart Fix race between async outdated parts loading and paranoid check on startup Feb 11, 2026
@alexey-milovidov alexey-milovidov marked this pull request as ready for review February 12, 2026 01:51
@alexey-milovidov alexey-milovidov self-assigned this Feb 12, 2026
@alexey-milovidov alexey-milovidov merged commit 3c501a2 into master Feb 12, 2026
133 of 134 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-paranoid-check-assertion branch February 12, 2026 07:14
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 12, 2026
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Apr 16, 2026
…rOnStart

The paranoid check on startup verifies that all ZK parts covered by another
ZK part either exist on disk or are in the parts_to_fetch list. However, it
does not check whether a local active part already covers the ZK part.

This legitimately happens after a merge or mutation: the covering part is
active locally, the covered part was cleaned from disk by
clearOldPartsAndRemoveFromZK, but the covered part's ZK entry has not been
cleaned up yet (ZK cleanup can lag behind disk cleanup, or the server was
restarted between the two). The data is preserved in the local covering
part, and the stale ZK entry will be removed by the cleanup thread after
startup.

Without this fix, the chassert(false) fires as a false positive in
debug/sanitizer builds during stress tests, particularly in the
createLogEntriesToFetchBrokenParts() call path (STID: 2508-5dc3,
2508-6644). Prior fixes PR ClickHouse#96164, ClickHouse#96705, and ClickHouse#100652 addressed other
false-positive scenarios in this function but missed the case where a local
active part covers the ZK part.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants