Fix logical error in quorum insert recovery when part was discarded by failed-quorum cleanup by alexey-milovidov · Pull Request #106424 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix logical error in quorum insert recovery when part was discarded by failed-quorum cleanup#106424

Merged
alexey-milovidov merged 17 commits into
masterfrom
fix-quorum-insert-failed-quorum-cleanup-race
Jul 1, 2026
Merged

Fix logical error in quorum insert recovery when part was discarded by failed-quorum cleanup#106424
alexey-milovidov merged 17 commits into
masterfrom
fix-quorum-insert-failed-quorum-cleanup-race

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 4, 2026

Copy link
Copy Markdown
Member

A quorum INSERT whose part is committed to Keeper but whose client receives a hardware error enters the recovery loop in ReplicatedMergeTreeSink::commitPart. While it keeps the still PreActive part and tries to verify the status in Keeper, another replica can mark the quorum as failed, and ReplicatedMergeTreeRestartingThread::removeFailedQuorumParts discards the part via forcefullyMovePartToDetachedAndRemoveFromMemory, removing it from the working set. If the recovery then cannot verify the status (eternal hardware error) and reaches actionAfterLastFailedRetry, the unconditional transaction.commit raised Part ... doesn't exist — a logical error that aborts the server (the commit runs inside a NOEXCEPT_SCOPE, so it terminates release builds too).

The fix takes the parts lock in actionAfterLastFailedRetry and commits the transaction only if the part is still PreActive; otherwise it clears the transaction, because the part was already discarded by the failed-quorum cleanup. Taking the decision under the parts lock makes it free of a race with the cleanup. The query still reports UNKNOWN_STATUS_OF_INSERT as before.

A new deterministic integration test reproduces the abort by forcing the recovery to be unable to verify the part status in Keeper (replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault) while the failed-quorum cleanup runs. Verified locally: the test aborts without the fix and passes with it.

Found by CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=3d01fe3bf67202e2bb11d8054d7988b2d0b961dc&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a logical error (Part ... doesn't exist) that could abort the server during recovery of a quorum INSERT from a Keeper hardware error when the quorum was concurrently marked as failed.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.7.1.394

@clickhouse-gh

clickhouse-gh Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 4, 2026
alexey-milovidov and others added 2 commits June 4, 2026 07:53
…y failed-quorum cleanup

When a quorum `INSERT` commits its part to Keeper but the client gets a
hardware error, `ReplicatedMergeTreeSink::commitPart` keeps the still
`PreActive` part and retries to verify the status in Keeper. If it cannot
verify it (eternal hardware error), `actionAfterLastFailedRetry` commits the
local part and reports `UNKNOWN_STATUS_OF_INSERT`.

Meanwhile another replica can mark the quorum as failed, and
`ReplicatedMergeTreeRestartingThread::removeFailedQuorumParts` discards the
still `PreActive` part via `forcefullyMovePartToDetachedAndRemoveFromMemory`,
removing it from the working set. The subsequent `transaction.commit` then
raised the `Part ... doesn't exist` logical error (server abort).

Check the part state under the parts lock and only commit it if it is still
`PreActive`; otherwise clear the transaction, because the part was already
discarded by the failed-quorum cleanup. The decision is taken under the parts
lock, so it cannot race with the cleanup.

Reproduced by a new integration test that forces the recovery to be unable to
verify the part status in Keeper (via the
`replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault` failpoint)
while the failed-quorum cleanup runs.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=3d01fe3bf67202e2bb11d8054d7988b2d0b961dc&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov force-pushed the fix-quorum-insert-failed-quorum-cleanup-race branch from 72146a9 to 34128d7 Compare June 4, 2026 07:53
Comment thread src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp
Comment thread tests/integration/test_quorum_inserts/test.py Outdated
Comment thread src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp
Comment thread tests/integration/test_quorum_inserts/test.py Outdated
Comment thread tests/integration/test_quorum_inserts/test.py Outdated
Three follow-ups from review of the quorum INSERT recovery fix:

- `forcefullyMovePartToDetachedAndRemoveFromMemory` moves a still
  `PreActive` part out of the working set without notifying
  `preactive_parts_cv`. A concurrent `DROP_RANGE`/`REPLACE` blocked in
  `waitForPreActivePartsInRange` could therefore miss the predicate
  becoming false and wait until some unrelated notification. The
  failed-quorum cleanup that this PR handles takes exactly this path,
  so wake the waiters at the source when a `PreActive` part is removed.

- Add a `LOG_DEBUG` on the rare branch where the part was already
  discarded by the failed-quorum cleanup (nothing to commit), to aid
  post-mortems, and trim the explanatory comment.

- The integration test only asserted that the `INSERT` raised some
  exception; assert that it is specifically `UNKNOWN_STATUS_OF_INSERT`,
  so a `TABLE_IS_READ_ONLY`, raw Keeper, or cancellation error can no
  longer satisfy the regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Addressed all review feedback in 8db039b:

  • Liveness (AI review major): forcefullyMovePartToDetachedAndRemoveFromMemory now notifies preactive_parts_cv when it removes a still-PreActive part. The failed-quorum cleanup takes exactly this path, so a concurrent DROP_RANGE/REPLACE blocked in waitForPreActivePartsInRange is now reliably woken — fixing it at the source rather than only in the transaction.clear branch closes the pre-existing latent gap too.
  • Test assertion: the integration test now asserts the client gets UNKNOWN_STATUS_OF_INSERT specifically, not just any exception.
  • Polish (tiandiwonder): added a LOG_DEBUG on the discarded-part branch and trimmed the comment.

Built clean (-Weverything, both MergeTreeData.cpp.o and ReplicatedMergeTreeSink.cpp.o). CI was already green; pushing re-runs it.

Comment thread src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp
alexey-milovidov and others added 3 commits June 8, 2026 17:27
…a PreActive part

When `removeFailedQuorumParts` discards a still-`PreActive` part via
`forcefullyMovePartToDetachedAndRemoveFromMemory`, the part owns an
uncommitted part storage transaction. On object-storage disks its writes
and the rename into `detached/` are only buffered and are normally
materialized by `MergeTreeData::Transaction::commit` through
`commitTransaction`. Since the recovery path in `commitPart` now skips
that commit (the part already left the working set), the detached part
would never be materialized and its blobs would be orphaned.

Commit the active part storage transaction at the moment the `PreActive`
part is force-detached so the detached part is materialized regardless of
what the insert thread does next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`getDataPartStorage` on a const `DataPartPtr` returns a const reference, so
call it through `asMutableDeletingPart` (as already done for `renameToDetached`
just above) to invoke the non-const `commitTransaction`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged latest master (was ~1097 commits behind) — clean, no conflicts.

Addressed the review thread about the object-storage part storage transaction:

  • When removeFailedQuorumParts discards a still-PreActive part via forcefullyMovePartToDetachedAndRemoveFromMemory, the part still owns an uncommitted part storage transaction. On object-storage disks its writes and the rename into detached/ were only buffered and would never be materialized once commitPart takes the transaction.clear path, orphaning the blobs.
  • forcefullyMovePartToDetachedAndRemoveFromMemory now commits the active part storage transaction (commitTransaction) right after renameToDetached when the discarded part was PreActive, so the detached part is materialized regardless of what the insert thread does next.

Verified locally: test_quorum_inserts/test.py::test_insert_quorum_with_keeper_fail_during_unknown_status passes (1 passed in 41.30s) and the build is clean.

The only CI red was AST fuzzer (amd_debug): Logical error: Column identifier A is already registered (STID: 4697-4326). This is unrelated to this PR — it is a pre-existing master bug on the mutation IN/EXISTS-subquery planning path (MutationsInterpreter.cppbuildSubqueryPlansForSetsAndAdd), already fixed by the open PR #106414. It will clear once that lands / once the fuzzer re-runs against updated master.

/// the working set without that commit ever happening, so commit the part storage transaction now;
/// otherwise the detached part is never materialized and its blobs are orphaned.
if (was_preactive && part->getDataPartStorage().hasActiveTransaction())
asMutableDeletingPart(part)->getDataPartStorage().commitTransaction();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This closes the object-storage transaction only after the part has already been moved out of PreActive, but before the helper erases it from data_parts_indexes and wakes preactive_parts_cv. If commitTransaction throws here (for example, an object-storage metadata commit or disk error), removeFailedQuorumParts unwinds with the part in Deleting state and still indexed, and any waitForPreActivePartsInRange waiter can sleep forever because the predicate became false without a notification. The insert recovery thread will then also see part->getState() != PreActive and call transaction.clear, so the original insert transaction no longer rolls the part/storage transaction back.

Please make this transition exception-safe: once the helper changes a PreActive part to another state, it must either complete the detach bookkeeping or restore/roll back enough state, and it must notify the condition variable even if closing the part storage transaction fails. The commitPart transaction.clear branch should only be reachable after the detach path has definitely finished closing the storage transaction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-milovidov It's indeed a real ssue.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the transition exception-safe in 38116c167c7. The part storage transaction is now committed before renameToDetached (where nothing has been buffered into it yet on this path), and if commitTransaction throws the part state is restored to PreActive and the exception is rethrown:

modifyPartState(it_part, DataPartState::Deleting, lock);

if (was_preactive && part->getDataPartStorage().hasActiveTransaction())
{
    try
    {
        asMutableDeletingPart(part)->getDataPartStorage().commitTransaction();
    }
    catch (...)
    {
        modifyPartState(it_part, DataPartState::PreActive, lock);
        throw;
    }
}

asMutableDeletingPart(part)->renameToDetached(prefix, /*ignore_error=*/ replicated);
data_parts_indexes.erase(it_part);

if (was_preactive)
    preactive_parts_cv.notify_all();

So on failure the helper unwinds with the part still PreActive, still indexed, and with its storage transaction active — the recovering INSERT then sees PreActive and commits or rolls it back normally, and waitForPreActivePartsInRange waiters keep waiting on a correctly-still-PreActive part rather than sleeping forever on one stuck in Deleting. The transaction.clear branch in commitPart is therefore only reachable after the detach path has definitely closed the storage transaction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not fully exception-safe in current 38116c167c7: after commitTransaction succeeds at MergeTreeData.cpp:5903, renameToDetached at MergeTreeData.cpp:5912 can still throw DB::Exception (for example from object-storage moveDirectory; ignore_error only suppresses ErrnoException / fs::filesystem_error). If that happens, the helper unwinds with the part left in Deleting and still indexed, and the recovering insert then observes part->getState() != PreActive at ReplicatedMergeTreeSink.cpp:1040 and calls Transaction::clear, so the part is stuck and preactive_parts_cv is not notified. Please keep the renameToDetached / erase part of the was_preactive transition exception-safe too: either complete detach, erase, and notify, or restore PreActive before rethrowing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — renameToDetached(prefix, ignore_error=replicated) only swallows ErrnoException and fs::filesystem_error, so any other DB::Exception (e.g. from an object-storage moveDirectory) propagates and the was_preactive wake-up was skipped.

The important detail is that waitForPreActivePartsInRange's predicate counts only PreActive parts, so it already flips to false at the modifyPartState(it_part, Deleting, lock) transition above — before either commitTransaction or renameToDetached runs. So the gap was purely a lost wake-up, not a stuck predicate: the waiter would have been satisfied on re-check but was never notified.

Fixed in 7553384d92b by moving the notification into a SCOPE_EXIT registered right after the part leaves PreActive, so the wake-up is guaranteed on every exit path (success, a throwing commitTransaction, or a throwing renameToDetached):

modifyPartState(it_part, DataPartState::Deleting, lock);

SCOPE_EXIT({
    if (was_preactive && part->getState() != DataPartState::PreActive)
        preactive_parts_cv.notify_all();
});

The guard fires only when the part actually ended up out of PreActive, so the commitTransaction failure path — which restores the part to PreActive — emits no spurious notification.

Note the alternative "restore PreActive before rethrowing" doesn't apply on the rename path: once commitTransaction has succeeded the storage transaction is gone, so the part can't be meaningfully put back to PreActive. Waking the waiters to re-check (the predicate is already satisfied) is the correct guarantee; the recovering INSERT then observes the part is no longer PreActive and calls Transaction::clear, which is correct because the storage transaction was already committed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code still leaves this transition non-exception-safe. After commitTransaction succeeds, renameToDetached at src/Storages/MergeTree/MergeTreeData.cpp:5922 can still throw; the new SCOPE_EXIT only notifies, but it neither restores the part to PreActive nor erases it from data_parts_indexes. Then commitPart observes part->getState() != PreActive at src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp:1040 and calls Transaction::clear, so the insert transaction will not roll the part back.

The stale Deleting entry is not cleaned by the part checker either: findLocalPart only returns PreActive or active containing parts, and a later fetch/empty-part attempt will run into the existing Deleting duplicate. Please keep the renameToDetached path exception-safe too: either complete detach/erase, or restore PreActive before rethrowing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few points on the remaining concern:

  1. The gap this PR actually introduced — the lost wake-up of preactive_parts_cv — is now closed by the SCOPE_EXIT (it fires on the renameToDetached-throw path too, because the part is already out of PreActive).

  2. The "part left in Deleting and still indexed if renameToDetached throws" is pre-existing behavior of forcefullyMovePartToDetachedAndRemoveFromMemory for all callers — before this PR the body was modifyPartState(Deleting); renameToDetached(...); data_parts_indexes.erase(...) with no exception handling around the rename/erase. This PR does not change that path; it only adds the (exception-safe) commitTransaction and the (now exception-safe) wake-up for the was_preactive case.

  3. The suggested "restore PreActive before rethrowing" is not correct on this path. The part is being discarded precisely because the quorum was marked failed; restoring it to PreActive so the recovering INSERT then commits it to Active would resurrect a failed-quorum insert and make it visible — the opposite of the intended outcome. (That restore is only sound on the commitTransaction-failure branch, where the storage transaction is still active and the decision is genuinely handed back to the INSERT.)

  4. The other option — completing the erase even when the best-effort detach rename throws a non-ErrnoException/fs::filesystem_error DB::Exception — is reasonable in spirit (ignore_error=replicated is meant to be best-effort), but it leaves the committed part files orphaned in their original location and so interacts with restart re-discovery and the enqueuePartForCheck path. That is a broader hardening of forcefullyMovePartToDetachedAndRemoveFromMemory affecting every replicated caller, not just the quorum-recovery one, so I'd rather not fold it into this targeted bug fix without an explicit decision on the desired discard semantics.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current head still has the same exception path: after commitTransaction succeeds, renameToDetached at src/Storages/MergeTree/MergeTreeData.cpp:5922 can throw, and the only guaranteed cleanup is the SCOPE_EXIT notification at lines 5896-5899. That wakes waiters, but it does not restore PreActive or erase the part from data_parts_indexes, so the part can remain indexed as Deleting while commitPart clears the insert transaction at src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp:1049.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current commitTransaction failure branch still hands a failed-quorum part back to the insert side in a state that can be committed. removeFailedQuorumParts removes the part from Keeper before calling this helper, and the retry loop reaches actionAfterLastFailedRetry only after Keeper reads keep failing, so it cannot observe the remaining quorum/failed_parts marker. If this catch restores PreActive, commitPart can take the part->getState() == PreActive branch and commit the local part, making a failed-quorum row visible until the delayed part check detaches it. Please do not restore a failed-quorum-discarded part to a state where the insert fallback can commit it after Keeper cleanup has already happened; the local discard either needs to complete/record an uncommittable state, or the insert fallback needs a way to distinguish this failed cleanup from a still-committable unknown-status part.

/// the working set without that commit ever happening, so commit the part storage transaction now;
/// otherwise the detached part is never materialized and its blobs are orphaned.
if (was_preactive && part->getDataPartStorage().hasActiveTransaction())
asMutableDeletingPart(part)->getDataPartStorage().commitTransaction();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-milovidov It's indeed a real ssue.

If `commitTransaction` threw in `forcefullyMovePartToDetachedAndRemoveFromMemory`
after the part had already been renamed to detached, the part was left in the
`Deleting` state and still present in `data_parts_indexes`, and
`preactive_parts_cv` was never notified, so `waitForPreActivePartsInRange`
waiters could sleep forever; the recovering `INSERT` would also see a
non-`PreActive` state and clear the transaction without anyone rolling the
part storage transaction back.

Commit the part storage transaction before `renameToDetached`, while nothing
has been buffered into it on this path, and restore the `PreActive` state if
the commit throws. A failure now unwinds with the part still `PreActive`,
indexed, and with its storage transaction active, so the recovering `INSERT`
commits or rolls it back normally, and the `transaction.clear` branch in
`commitPart` is only reachable after the cleanup has definitely closed the
storage transaction.
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 38116c167c7: made the PreActive part discard in `forcefullyMovePartToDetachedAndRemoveFromMemory` exception-safe, addressing the last review thread. The part storage transaction is now committed before renameToDetached, and on a commit failure the part state is restored to PreActive (still indexed, storage transaction still active) and the exception is rethrown, so the recovering INSERT commits or rolls it back normally and waitForPreActivePartsInRange waiters never sleep on a part stuck in Deleting. The transaction.clear branch in commitPart is only reachable after the storage transaction is definitely closed. Compiled clean; previous CI run was fully green.

alexey-milovidov and others added 2 commits June 14, 2026 05:56
The `was_preactive` notification in `forcefullyMovePartToDetachedAndRemoveFromMemory`
was only sent after `renameToDetached` and `data_parts_indexes.erase` succeeded.
But `renameToDetached(prefix, ignore_error=replicated)` only swallows `ErrnoException`
and `fs::filesystem_error`; any other `DB::Exception` (for example from an
object-storage `moveDirectory`) propagates. The `waitForPreActivePartsInRange`
predicate counts only `PreActive` parts, so it already flips to false at the
`modifyPartState(..., Deleting, ...)` transition above — long before the rename.
If the rename then threw, the predicate was already satisfied but
`preactive_parts_cv.notify_all` was skipped, so a blocked `DROP_RANGE` waiter could
sleep forever on a lost wake-up.

Move the notification into a `SCOPE_EXIT` registered right after the part leaves
`PreActive`, guaranteeing the wake-up on every exit path (success, a throwing
`commitTransaction`, or a throwing `renameToDetached`). The guard fires only when
the part actually ended up out of `PreActive`, so the `commitTransaction` failure
path — which restores the part to `PreActive` — does not emit a spurious
notification.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed two updates:

  1. Addressed the remaining review thread (exception-safety of the PreActive discard in forcefullyMovePartToDetachedAndRemoveFromMemory): renameToDetached(ignore_error=replicated) only swallows ErrnoException/fs::filesystem_error, so any other DB::Exception (e.g. object-storage moveDirectory) could skip the preactive_parts_cv wake-up. Since waitForPreActivePartsInRange counts only PreActive parts, the predicate already flips at modifyPartState(..., Deleting, ...), so the gap was a lost wake-up, not a stuck predicate. Moved the notification into a SCOPE_EXIT that fires on every exit path but only when the part actually left PreActive (so the commitTransaction-failure restore path emits no spurious notification). Commit 7553384d92b.

  2. Merged master to pull in the now-merged Fix Memory table abort on ALTER TABLE ... APPLY PATCHES (STID 2380-3043) #105286 (Fix Memory table abort on ALTER TABLE ... APPLY PATCHES). The only red on the prior CI run was BuzzHouse (amd_debug) failing with Logical error: Mutation of Memory table produced incomplete output (Logical error: Mutation of Memory table produced incomplete output: A (STID: 2380-3043) #106846) — a master-wide BuzzHouse issue unrelated to this PR (it does not touch StorageMemory; the same failure appeared on ~15 other PRs that day). That issue was fixed by Fix Memory table abort on ALTER TABLE ... APPLY PATCHES (STID 2380-3043) #105286, which merged on 2026-06-13 at 16:18 UTC, just after this branch's previous master merge at 11:51 UTC, so the failure was stale. Re-merging master (78543b7345b) brings the fix in.

MergeTreeData.cpp recompiles cleanly; net diff is unchanged in scope (3 files).

{
try
{
asMutableDeletingPart(part)->getDataPartStorage().commitTransaction();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: this object-storage materialization path has no test.

the integration test runs on a local-disk table where a PreActive part has no active transaction, so this branch is skipped; an s3 / SharedMergeTree variant asserting the detached noquorum_* part survives the discard (re-attachable, no orphaned blobs) would cover it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0391cbe — added test_insert_quorum_with_keeper_fail_during_unknown_status_object_storage, which runs the same deterministic failpoint orchestration on a ReplicatedMergeTree table backed by an s3 disk (MinIO). It asserts the failed-quorum part is materialized as a detached noquorum part on the s3 disk via system.detached_parts. On object storage that row only appears if commitTransaction ran in the discard path; without the fix the buffered rename into detached/ would be lost together with the discarded part storage transaction and the blobs would be orphaned, so the assertion directly covers the was_preactive && hasActiveTransaction() branch. Verified locally: both the existing local-disk test and the new object-storage test pass.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Re-merged master (913781ca38c): the branch was ~2191 commits behind with a merge-base of 2026-06-14 (>1 week), and master had meanwhile modified the same files heavily (48 commits touching MergeTreeData.cpp/ReplicatedMergeTreeSink.cpp, including the quorum-wait work #107513/#107760waitForQuorum cancellation). The merge was clean with no conflicts.

Verified after the merge:

  • The PR's net diff is unchanged (181 insertions / 1 deletion across the same 3 files); master's quorum changes are in waitForQuorum, a separate function from the commitPart/actionAfterLastFailedRetry path this PR touches.
  • Every API the fix depends on is still present with a compatible signature: Transaction::commit(DataPartsLock&), IDataPartStorage::commitTransaction/hasActiveTransaction, asMutableDeletingPart, renameToDetached, preactive_parts_cv, lockParts, Transaction::clear.

CI was green (162/139/0) at the previous head 78543b73; the push re-runs it against fresh master.

Comment thread tests/integration/test_quorum_inserts/test.py
alexey-milovidov and others added 5 commits June 28, 2026 12:15
…of hanging

The added `test_insert_quorum_with_keeper_fail_during_unknown_status` submits
the `INSERT` into a `ThreadPoolExecutor` and pauses it at the
`replicated_merge_tree_insert_retry_pause` failpoint. If any assertion or wait
failed before the failpoint was disabled, leaving the `with ThreadPoolExecutor`
block would call `executor.shutdown(wait=True)`, which blocks forever on the
still-paused insert, so the outer `finally` that disables the failpoint never
runs and the test hangs (CI timeout) instead of failing cleanly.

Wrap the executor-block body in a `try`/`finally` that disables
`replicated_merge_tree_insert_retry_pause` before the executor is joined, so the
paused insert is always released and a failed assertion produces a clean
failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion

The existing `test_insert_quorum_with_keeper_fail_during_unknown_status` runs on
a local-disk table, where a discarded `PreActive` part's storage transaction is
a no-op, so it does not exercise the object-storage branch in
`forcefullyMovePartToDetachedAndRemoveFromMemory` that commits the part storage
transaction before detaching (the `was_preactive && hasActiveTransaction()`
path). This was raised in review:
#106424 (comment)

Add `test_insert_quorum_with_keeper_fail_during_unknown_status_object_storage`:
the same deterministic failpoint orchestration, but on a `ReplicatedMergeTree`
table backed by an `s3` disk. After the recovery reports
`UNKNOWN_STATUS_OF_INSERT` and no logical error, it asserts the failed-quorum
part was materialized as a detached `noquorum` part on the `s3` disk (via
`system.detached_parts`). On object storage that row only appears if
`commitTransaction` ran in the discard path; without the fix the buffered
rename into `detached/` would be lost together with the discarded part storage
transaction and the written blobs would be orphaned.

Adds a MinIO-backed `s3` disk and storage policy to the test module configs.
Verified locally: both the local-disk and the new object-storage tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 0391cbe7540.

Addressed the AI Review test gap (object-storage coverage). Added test_insert_quorum_with_keeper_fail_during_unknown_status_object_storage (+ a MinIO s3 disk/policy in the test configs). It reproduces the same deterministic failpoint scenario on a ReplicatedMergeTree table backed by an s3 disk and asserts the failed-quorum part is materialized as a detached noquorum part on the s3 disk (system.detached_parts). On object storage that row only appears if commitTransaction ran when the still-PreActive part was discarded — the was_preactive && hasActiveTransaction() branch — so it directly covers the path the local-disk test skips (there the storage transaction is a no-op). Verified locally against this branch's binary: both the local-disk and the new object-storage test pass; the system.detached_parts ... reason = 'noquorum' assertion holds.

On the remaining AI "Request changes" major (the post-commitTransaction renameToDetached exception path, MergeTreeData.cpp): this is the same point dismissed in #106424 (comment), and I agree with that dismissal after re-checking the code. The only liveness gap this PR introduces (a lost preactive_parts_cv wake-up) is closed by the SCOPE_EXIT, which fires on the throwing-renameToDetached path too. A part left in Deleting and indexed if renameToDetached throws a non-ErrnoException/fs::filesystem_error is pre-existing behaviour of forcefullyMovePartToDetachedAndRemoveFromMemory for all callers (the rename→erase ordering is unchanged here), and on that path commitPart's transaction.clear is correct because commitTransaction already committed the storage transaction — there is nothing left to roll back. Making the rename/erase path itself exception-safe is a broader hardening of that helper affecting every replicated caller, out of scope for this targeted fix.

CI: the only red jobs are the two Stress test (amd_tsan / arm_tsan) Hung check failures, which the CI comment itself links to #107941. I downloaded the amd_tsan stress run logs (8 runs): zero frames from this PR's code (forcefullyMovePartToDetachedAndRemoveFromMemory / removeFailedQuorumParts / ReplicatedMergeTreeSink / commitPart / preactive_parts_cv / actionAfterLastFailedRetry); the hung thread is the AST fuzzer (executeASTFuzzerQueriesInterpreterSelectQueryAnalyzerTCPHandler::runImpl), the known #107941 pattern. Not caused by this change.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The latest CI run (28457274334, head 0391cbe7540) is still in progress; its 4 red checks are all unrelated flakes:

  • 03164_parallel_replicas_range_filter_min_max (Stateless tests amd_debug, parallel) — the CI's own randomized-settings diagnosis reports Runs: 160, Failed: 0, Passed: 160 ... All reruns passed. The failure is not reproducible. It is a parallel-replicas range-filter test unrelated to quorum inserts / MergeTreeData, and it failed the same day on several unrelated PRs (108361, 108788, 108005). A non-reproducible flake.
  • Hung check failed ×3 (amd_asan_ubsan, amd_tsan, arm_tsan) — the known open Hung check failed, possible deadlock found #107941 (416 failures across 169 PRs on 2026-06-30 alone). The hung thread is the AST fuzzer (executeASTFuzzerQueriesInterpreterSelectQueryAnalyzerQueryAnalysisPass), with zero frames from this PR's code paths.

A --failed rerun will clear these once the run finishes (it is rejected while the workflow is still running).

On the recurring AI "Request changes" major (the renameToDetached-throws-after-commitTransaction path in forcefullyMovePartToDetachedAndRemoveFromMemory): confirmed it is pre-existing master behavior, not introduced here — master's body is modifyPartState(Deleting); renameToDetached(...); data_parts_indexes.erase(...) with no exception handling around the rename/erase, byte-identical ordering to this PR. The only liveness gap this PR actually introduces (a lost preactive_parts_cv wake-up) is closed by the SCOPE_EXIT, which fires on the throwing-renameToDetached path too. Making the rename/erase path itself exception-safe for every replicated caller is a broader hardening, out of scope for this targeted fix — as already discussed in #106424 (comment).

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Re-merged master into the branch (0391cbe7540..dfadfd1e7f6), giving a fresh CI run (28491146384, head dfadfd1e7f6).

Why re-merge now: master had advanced 620 commits since the previous merge-base (d70da4f4137, 06-30 00:38) and — importantly — it evolved the exact files this PR modifies most: 44 commits touched MergeTreeData.cpp and 4 touched ReplicatedMergeTreeSink.cpp. The merge was clean (0 conflicts); the net diff vs current master is unchanged (+374/-4 across the same 4 files) and the fix is intact. I re-verified that every API the fix relies on still exists with a compatible signature in the merged tree: Transaction::commit(DataPartsLock &), lockParts(), preactive_parts_cv, and asMutableDeletingPart; and that the fix still sits coherently inside forcefullyMovePartToDetachedAndRemoveFromMemory.

On the previous CI reds: the only red jobs on 0391cbe7540 were Stress test (amd_tsan / arm_tsan) — both the Hung check failed, possible deadlock found flake that the CI comment itself links to #107941. The hung thread is the AST fuzzer (executeASTFuzzerQueriesTCPHandler::run), with zero frames from this PR's code paths. The fresh run rebuilds against current master and re-runs those stress jobs.

The recurring AI "Request changes" major (the renameToDetached-throws-after-commitTransaction path) remains as previously discussed — pre-existing master behavior of forcefullyMovePartToDetachedAndRemoveFromMemory for all callers, out of scope for this targeted fix.

@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.50% +0.00%
Functions 92.70% 92.70% +0.00%
Branches 77.70% 77.60% -0.10%

Changed lines: Changed C/C++ lines covered: 23/25 (92.00%) · Uncovered code

Full report · Diff report

@alexey-milovidov

alexey-milovidov commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jul 1, 2026
Merged via the queue into master with commit 417e793 Jul 1, 2026
343 of 344 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-quorum-insert-failed-quorum-cleanup-race branch July 1, 2026 22:36
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants