{{ message }}
Fix Keeper crash on stale snapshot chunk request#107141
Closed
groeneai wants to merge 1 commit into
Closed
Conversation
read_logical_snp_obj aborted with a chassert(false) when called with obj_id > 0 and a null user_snp_ctx. That state is reachable during snapshot rotation: the obj_id == 0 request can return -1 without creating a transfer context (the requested snapshot was retired so its pin is missing, or the loader failed to init). NuRaft then resets the peer's snapshot_sync_ctx but keeps its offset > 0, so the next async_io_loop iteration asks for a later chunk while the fresh context has user_snp_ctx == nullptr, and the leader aborts. This is a legitimate stale request, not an invariant violation. Decline it the same way as the missing-pin deferral (free the context, count it as KeeperReadSnapshotDeferred, return -1); NuRaft restarts the install from obj_id 0 against the latest snapshot. Marking it as a deferral rather than a failure also stops it from inflating KeeperReadSnapshotFailed. Add a unit test that drives read_logical_snp_obj with obj_id > 0 and no context and asserts it declines instead of crashing, then verifies a normal transfer still succeeds. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
Contributor
Author
|
cc @antonio2368 @Michicosun - could you review? This fixes a Keeper leader crash ( |
Member
Contributor
Author
This was referenced Jun 11, 2026
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix a
Keeperserver crash (Logical error: 'false') that could happen during snapshot transfer to a recovering follower while snapshots were rotating. A stale snapshot chunk request is now declined instead of aborting the server.Description
IKeeperStateMachine::read_logical_snp_objaborted withchassert(false)when NuRaft called it withobj_id > 0and a nulluser_snp_ctx.That state is reachable during snapshot rotation under write load. The
obj_id == 0request can return-1without creating a transfer context (the requested snapshot was retired so its pin is missing, or the loader failed to init). NuRaft then resets the peer'ssnapshot_sync_ctxbut keeps itsoffset > 0, so the nextasync_io_loopiteration asks for a later chunk while the fresh context hasuser_snp_ctx == nullptr, and the leader aborts.This is a legitimate stale request, not an invariant violation, so the chunk is now declined the same way as the existing missing-pin deferral: free the context, count it as
KeeperReadSnapshotDeferred, and return-1. NuRaft restarts the install fromobj_id 0against the latest snapshot. Marking it as a deferral rather than a failure also stops it from inflatingKeeperReadSnapshotFailed.Introduced by #104941. Surfaced by the
test_keeper_snapshot_rotation_raceintegration test on master (Logical error: 'false'inread_logical_snp_obj, andKeeperReadSnapshotFailed delta = 1). Added a deterministic unit test (ReadSnapshotChunkWithoutContextDoesNotCrash) that drives the exact state: it aborts the server on the current code and passes with this fix.