Fix logical error in quorum insert recovery when part was discarded by failed-quorum cleanup#106424
Conversation
…y failed-quorum cleanup When a quorum `INSERT` commits its part to Keeper but the client gets a hardware error, `ReplicatedMergeTreeSink::commitPart` keeps the still `PreActive` part and retries to verify the status in Keeper. If it cannot verify it (eternal hardware error), `actionAfterLastFailedRetry` commits the local part and reports `UNKNOWN_STATUS_OF_INSERT`. Meanwhile another replica can mark the quorum as failed, and `ReplicatedMergeTreeRestartingThread::removeFailedQuorumParts` discards the still `PreActive` part via `forcefullyMovePartToDetachedAndRemoveFromMemory`, removing it from the working set. The subsequent `transaction.commit` then raised the `Part ... doesn't exist` logical error (server abort). Check the part state under the parts lock and only commit it if it is still `PreActive`; otherwise clear the transaction, because the part was already discarded by the failed-quorum cleanup. The decision is taken under the parts lock, so it cannot race with the cleanup. Reproduced by a new integration test that forces the recovery to be unable to verify the part status in Keeper (via the `replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault` failpoint) while the failed-quorum cleanup runs. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=3d01fe3bf67202e2bb11d8054d7988b2d0b961dc&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
72146a9 to
34128d7
Compare
Three follow-ups from review of the quorum INSERT recovery fix: - `forcefullyMovePartToDetachedAndRemoveFromMemory` moves a still `PreActive` part out of the working set without notifying `preactive_parts_cv`. A concurrent `DROP_RANGE`/`REPLACE` blocked in `waitForPreActivePartsInRange` could therefore miss the predicate becoming false and wait until some unrelated notification. The failed-quorum cleanup that this PR handles takes exactly this path, so wake the waiters at the source when a `PreActive` part is removed. - Add a `LOG_DEBUG` on the rare branch where the part was already discarded by the failed-quorum cleanup (nothing to commit), to aid post-mortems, and trim the explanatory comment. - The integration test only asserted that the `INSERT` raised some exception; assert that it is specifically `UNKNOWN_STATUS_OF_INSERT`, so a `TABLE_IS_READ_ONLY`, raw Keeper, or cancellation error can no longer satisfy the regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed all review feedback in 8db039b:
Built clean ( |
…a PreActive part When `removeFailedQuorumParts` discards a still-`PreActive` part via `forcefullyMovePartToDetachedAndRemoveFromMemory`, the part owns an uncommitted part storage transaction. On object-storage disks its writes and the rename into `detached/` are only buffered and are normally materialized by `MergeTreeData::Transaction::commit` through `commitTransaction`. Since the recovery path in `commitPart` now skips that commit (the part already left the working set), the detached part would never be materialized and its blobs would be orphaned. Commit the active part storage transaction at the moment the `PreActive` part is force-detached so the detached part is materialized regardless of what the insert thread does next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ailed-quorum-cleanup-race
`getDataPartStorage` on a const `DataPartPtr` returns a const reference, so call it through `asMutableDeletingPart` (as already done for `renameToDetached` just above) to invoke the non-const `commitTransaction`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Merged latest Addressed the review thread about the object-storage part storage transaction:
Verified locally: The only CI red was |
| /// the working set without that commit ever happening, so commit the part storage transaction now; | ||
| /// otherwise the detached part is never materialized and its blobs are orphaned. | ||
| if (was_preactive && part->getDataPartStorage().hasActiveTransaction()) | ||
| asMutableDeletingPart(part)->getDataPartStorage().commitTransaction(); |
There was a problem hiding this comment.
This closes the object-storage transaction only after the part has already been moved out of PreActive, but before the helper erases it from data_parts_indexes and wakes preactive_parts_cv. If commitTransaction throws here (for example, an object-storage metadata commit or disk error), removeFailedQuorumParts unwinds with the part in Deleting state and still indexed, and any waitForPreActivePartsInRange waiter can sleep forever because the predicate became false without a notification. The insert recovery thread will then also see part->getState() != PreActive and call transaction.clear, so the original insert transaction no longer rolls the part/storage transaction back.
Please make this transition exception-safe: once the helper changes a PreActive part to another state, it must either complete the detach bookkeeping or restore/roll back enough state, and it must notify the condition variable even if closing the part storage transaction fails. The commitPart transaction.clear branch should only be reachable after the detach path has definitely finished closing the storage transaction.
There was a problem hiding this comment.
Made the transition exception-safe in 38116c167c7. The part storage transaction is now committed before renameToDetached (where nothing has been buffered into it yet on this path), and if commitTransaction throws the part state is restored to PreActive and the exception is rethrown:
modifyPartState(it_part, DataPartState::Deleting, lock);
if (was_preactive && part->getDataPartStorage().hasActiveTransaction())
{
try
{
asMutableDeletingPart(part)->getDataPartStorage().commitTransaction();
}
catch (...)
{
modifyPartState(it_part, DataPartState::PreActive, lock);
throw;
}
}
asMutableDeletingPart(part)->renameToDetached(prefix, /*ignore_error=*/ replicated);
data_parts_indexes.erase(it_part);
if (was_preactive)
preactive_parts_cv.notify_all();So on failure the helper unwinds with the part still PreActive, still indexed, and with its storage transaction active — the recovering INSERT then sees PreActive and commits or rolls it back normally, and waitForPreActivePartsInRange waiters keep waiting on a correctly-still-PreActive part rather than sleeping forever on one stuck in Deleting. The transaction.clear branch in commitPart is therefore only reachable after the detach path has definitely closed the storage transaction.
There was a problem hiding this comment.
This is still not fully exception-safe in current 38116c167c7: after commitTransaction succeeds at MergeTreeData.cpp:5903, renameToDetached at MergeTreeData.cpp:5912 can still throw DB::Exception (for example from object-storage moveDirectory; ignore_error only suppresses ErrnoException / fs::filesystem_error). If that happens, the helper unwinds with the part left in Deleting and still indexed, and the recovering insert then observes part->getState() != PreActive at ReplicatedMergeTreeSink.cpp:1040 and calls Transaction::clear, so the part is stuck and preactive_parts_cv is not notified. Please keep the renameToDetached / erase part of the was_preactive transition exception-safe too: either complete detach, erase, and notify, or restore PreActive before rethrowing.
There was a problem hiding this comment.
You're right — renameToDetached(prefix, ignore_error=replicated) only swallows ErrnoException and fs::filesystem_error, so any other DB::Exception (e.g. from an object-storage moveDirectory) propagates and the was_preactive wake-up was skipped.
The important detail is that waitForPreActivePartsInRange's predicate counts only PreActive parts, so it already flips to false at the modifyPartState(it_part, Deleting, lock) transition above — before either commitTransaction or renameToDetached runs. So the gap was purely a lost wake-up, not a stuck predicate: the waiter would have been satisfied on re-check but was never notified.
Fixed in 7553384d92b by moving the notification into a SCOPE_EXIT registered right after the part leaves PreActive, so the wake-up is guaranteed on every exit path (success, a throwing commitTransaction, or a throwing renameToDetached):
modifyPartState(it_part, DataPartState::Deleting, lock);
SCOPE_EXIT({
if (was_preactive && part->getState() != DataPartState::PreActive)
preactive_parts_cv.notify_all();
});The guard fires only when the part actually ended up out of PreActive, so the commitTransaction failure path — which restores the part to PreActive — emits no spurious notification.
Note the alternative "restore PreActive before rethrowing" doesn't apply on the rename path: once commitTransaction has succeeded the storage transaction is gone, so the part can't be meaningfully put back to PreActive. Waking the waiters to re-check (the predicate is already satisfied) is the correct guarantee; the recovering INSERT then observes the part is no longer PreActive and calls Transaction::clear, which is correct because the storage transaction was already committed.
There was a problem hiding this comment.
The current code still leaves this transition non-exception-safe. After commitTransaction succeeds, renameToDetached at src/Storages/MergeTree/MergeTreeData.cpp:5922 can still throw; the new SCOPE_EXIT only notifies, but it neither restores the part to PreActive nor erases it from data_parts_indexes. Then commitPart observes part->getState() != PreActive at src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp:1040 and calls Transaction::clear, so the insert transaction will not roll the part back.
The stale Deleting entry is not cleaned by the part checker either: findLocalPart only returns PreActive or active containing parts, and a later fetch/empty-part attempt will run into the existing Deleting duplicate. Please keep the renameToDetached path exception-safe too: either complete detach/erase, or restore PreActive before rethrowing.
There was a problem hiding this comment.
A few points on the remaining concern:
-
The gap this PR actually introduced — the lost wake-up of
preactive_parts_cv— is now closed by theSCOPE_EXIT(it fires on therenameToDetached-throw path too, because the part is already out ofPreActive). -
The "part left in
Deletingand still indexed ifrenameToDetachedthrows" is pre-existing behavior offorcefullyMovePartToDetachedAndRemoveFromMemoryfor all callers — before this PR the body wasmodifyPartState(Deleting); renameToDetached(...); data_parts_indexes.erase(...)with no exception handling around the rename/erase. This PR does not change that path; it only adds the (exception-safe)commitTransactionand the (now exception-safe) wake-up for thewas_preactivecase. -
The suggested "restore
PreActivebefore rethrowing" is not correct on this path. The part is being discarded precisely because the quorum was marked failed; restoring it toPreActiveso the recoveringINSERTthen commits it toActivewould resurrect a failed-quorum insert and make it visible — the opposite of the intended outcome. (That restore is only sound on thecommitTransaction-failure branch, where the storage transaction is still active and the decision is genuinely handed back to theINSERT.) -
The other option — completing the erase even when the best-effort detach rename throws a non-
ErrnoException/fs::filesystem_errorDB::Exception— is reasonable in spirit (ignore_error=replicatedis meant to be best-effort), but it leaves the committed part files orphaned in their original location and so interacts with restart re-discovery and theenqueuePartForCheckpath. That is a broader hardening offorcefullyMovePartToDetachedAndRemoveFromMemoryaffecting every replicated caller, not just the quorum-recovery one, so I'd rather not fold it into this targeted bug fix without an explicit decision on the desired discard semantics.
There was a problem hiding this comment.
Current head still has the same exception path: after commitTransaction succeeds, renameToDetached at src/Storages/MergeTree/MergeTreeData.cpp:5922 can throw, and the only guaranteed cleanup is the SCOPE_EXIT notification at lines 5896-5899. That wakes waiters, but it does not restore PreActive or erase the part from data_parts_indexes, so the part can remain indexed as Deleting while commitPart clears the insert transaction at src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp:1049.
There was a problem hiding this comment.
The current commitTransaction failure branch still hands a failed-quorum part back to the insert side in a state that can be committed. removeFailedQuorumParts removes the part from Keeper before calling this helper, and the retry loop reaches actionAfterLastFailedRetry only after Keeper reads keep failing, so it cannot observe the remaining quorum/failed_parts marker. If this catch restores PreActive, commitPart can take the part->getState() == PreActive branch and commit the local part, making a failed-quorum row visible until the delayed part check detaches it. Please do not restore a failed-quorum-discarded part to a state where the insert fallback can commit it after Keeper cleanup has already happened; the local discard either needs to complete/record an uncommittable state, or the insert fallback needs a way to distinguish this failed cleanup from a still-committable unknown-status part.
| /// the working set without that commit ever happening, so commit the part storage transaction now; | ||
| /// otherwise the detached part is never materialized and its blobs are orphaned. | ||
| if (was_preactive && part->getDataPartStorage().hasActiveTransaction()) | ||
| asMutableDeletingPart(part)->getDataPartStorage().commitTransaction(); |
If `commitTransaction` threw in `forcefullyMovePartToDetachedAndRemoveFromMemory` after the part had already been renamed to detached, the part was left in the `Deleting` state and still present in `data_parts_indexes`, and `preactive_parts_cv` was never notified, so `waitForPreActivePartsInRange` waiters could sleep forever; the recovering `INSERT` would also see a non-`PreActive` state and clear the transaction without anyone rolling the part storage transaction back. Commit the part storage transaction before `renameToDetached`, while nothing has been buffered into it on this path, and restore the `PreActive` state if the commit throws. A failure now unwinds with the part still `PreActive`, indexed, and with its storage transaction active, so the recovering `INSERT` commits or rolls it back normally, and the `transaction.clear` branch in `commitPart` is only reachable after the cleanup has definitely closed the storage transaction.
|
Pushed |
The `was_preactive` notification in `forcefullyMovePartToDetachedAndRemoveFromMemory` was only sent after `renameToDetached` and `data_parts_indexes.erase` succeeded. But `renameToDetached(prefix, ignore_error=replicated)` only swallows `ErrnoException` and `fs::filesystem_error`; any other `DB::Exception` (for example from an object-storage `moveDirectory`) propagates. The `waitForPreActivePartsInRange` predicate counts only `PreActive` parts, so it already flips to false at the `modifyPartState(..., Deleting, ...)` transition above — long before the rename. If the rename then threw, the predicate was already satisfied but `preactive_parts_cv.notify_all` was skipped, so a blocked `DROP_RANGE` waiter could sleep forever on a lost wake-up. Move the notification into a `SCOPE_EXIT` registered right after the part leaves `PreActive`, guaranteeing the wake-up on every exit path (success, a throwing `commitTransaction`, or a throwing `renameToDetached`). The guard fires only when the part actually ended up out of `PreActive`, so the `commitTransaction` failure path — which restores the part to `PreActive` — does not emit a spurious notification. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ailed-quorum-cleanup-race
|
Pushed two updates:
|
| { | ||
| try | ||
| { | ||
| asMutableDeletingPart(part)->getDataPartStorage().commitTransaction(); |
There was a problem hiding this comment.
optional: this object-storage materialization path has no test.
the integration test runs on a local-disk table where a PreActive part has no active transaction, so this branch is skipped; an s3 / SharedMergeTree variant asserting the detached noquorum_* part survives the discard (re-attachable, no orphaned blobs) would cover it.
There was a problem hiding this comment.
Addressed in 0391cbe — added test_insert_quorum_with_keeper_fail_during_unknown_status_object_storage, which runs the same deterministic failpoint orchestration on a ReplicatedMergeTree table backed by an s3 disk (MinIO). It asserts the failed-quorum part is materialized as a detached noquorum part on the s3 disk via system.detached_parts. On object storage that row only appears if commitTransaction ran in the discard path; without the fix the buffered rename into detached/ would be lost together with the discarded part storage transaction and the blobs would be orphaned, so the assertion directly covers the was_preactive && hasActiveTransaction() branch. Verified locally: both the existing local-disk test and the new object-storage test pass.
…ailed-quorum-cleanup-race
|
Re-merged Verified after the merge:
CI was green (162/139/0) at the previous head |
…ailed-quorum-cleanup-race
…of hanging The added `test_insert_quorum_with_keeper_fail_during_unknown_status` submits the `INSERT` into a `ThreadPoolExecutor` and pauses it at the `replicated_merge_tree_insert_retry_pause` failpoint. If any assertion or wait failed before the failpoint was disabled, leaving the `with ThreadPoolExecutor` block would call `executor.shutdown(wait=True)`, which blocks forever on the still-paused insert, so the outer `finally` that disables the failpoint never runs and the test hangs (CI timeout) instead of failing cleanly. Wrap the executor-block body in a `try`/`finally` that disables `replicated_merge_tree_insert_retry_pause` before the executor is joined, so the paused insert is always released and a failed assertion produces a clean failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ailed-quorum-cleanup-race
…ailed-quorum-cleanup-race
…tion The existing `test_insert_quorum_with_keeper_fail_during_unknown_status` runs on a local-disk table, where a discarded `PreActive` part's storage transaction is a no-op, so it does not exercise the object-storage branch in `forcefullyMovePartToDetachedAndRemoveFromMemory` that commits the part storage transaction before detaching (the `was_preactive && hasActiveTransaction()` path). This was raised in review: #106424 (comment) Add `test_insert_quorum_with_keeper_fail_during_unknown_status_object_storage`: the same deterministic failpoint orchestration, but on a `ReplicatedMergeTree` table backed by an `s3` disk. After the recovery reports `UNKNOWN_STATUS_OF_INSERT` and no logical error, it asserts the failed-quorum part was materialized as a detached `noquorum` part on the `s3` disk (via `system.detached_parts`). On object storage that row only appears if `commitTransaction` ran in the discard path; without the fix the buffered rename into `detached/` would be lost together with the discarded part storage transaction and the written blobs would be orphaned. Adds a MinIO-backed `s3` disk and storage policy to the test module configs. Verified locally: both the local-disk and the new object-storage tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Addressed the AI Review test gap (object-storage coverage). Added On the remaining AI "Request changes" major (the post- CI: the only red jobs are the two |
|
The latest CI run (
A On the recurring AI "Request changes" major (the |
…ailed-quorum-cleanup-race
|
Re-merged Why re-merge now: On the previous CI reds: the only red jobs on The recurring AI "Request changes" major (the |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 23/25 (92.00%) · Uncovered code |

A quorum
INSERTwhose part is committed to Keeper but whose client receives a hardware error enters the recovery loop inReplicatedMergeTreeSink::commitPart. While it keeps the stillPreActivepart and tries to verify the status in Keeper, another replica can mark the quorum as failed, andReplicatedMergeTreeRestartingThread::removeFailedQuorumPartsdiscards the part viaforcefullyMovePartToDetachedAndRemoveFromMemory, removing it from the working set. If the recovery then cannot verify the status (eternal hardware error) and reachesactionAfterLastFailedRetry, the unconditionaltransaction.commitraisedPart ... doesn't exist— a logical error that aborts the server (the commit runs inside aNOEXCEPT_SCOPE, so it terminates release builds too).The fix takes the parts lock in
actionAfterLastFailedRetryand commits the transaction only if the part is stillPreActive; otherwise it clears the transaction, because the part was already discarded by the failed-quorum cleanup. Taking the decision under the parts lock makes it free of a race with the cleanup. The query still reportsUNKNOWN_STATUS_OF_INSERTas before.A new deterministic integration test reproduces the abort by forcing the recovery to be unable to verify the part status in Keeper (
replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault) while the failed-quorum cleanup runs. Verified locally: the test aborts without the fix and passes with it.Found by CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=3d01fe3bf67202e2bb11d8054d7988b2d0b961dc&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a logical error (
Part ... doesn't exist) that could abort the server during recovery of a quorumINSERTfrom a Keeper hardware error when the quorum was concurrently marked as failed.Documentation entry for user-facing changes
Version info
26.7.1.394