Fix Keeper crash on stale snapshot chunk request by groeneai · Pull Request #107141 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix Keeper crash on stale snapshot chunk request#107141

Closed
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-keeper-snapshot-null-ctx-crash
Closed

Fix Keeper crash on stale snapshot chunk request#107141
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-keeper-snapshot-null-ctx-crash

Conversation

@groeneai

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a Keeper server crash (Logical error: 'false') that could happen during snapshot transfer to a recovering follower while snapshots were rotating. A stale snapshot chunk request is now declined instead of aborting the server.

Description

IKeeperStateMachine::read_logical_snp_obj aborted with chassert(false) when NuRaft called it with obj_id > 0 and a null user_snp_ctx.

That state is reachable during snapshot rotation under write load. The obj_id == 0 request can return -1 without creating a transfer context (the requested snapshot was retired so its pin is missing, or the loader failed to init). NuRaft then resets the peer's snapshot_sync_ctx but keeps its offset > 0, so the next async_io_loop iteration asks for a later chunk while the fresh context has user_snp_ctx == nullptr, and the leader aborts.

This is a legitimate stale request, not an invariant violation, so the chunk is now declined the same way as the existing missing-pin deferral: free the context, count it as KeeperReadSnapshotDeferred, and return -1. NuRaft restarts the install from obj_id 0 against the latest snapshot. Marking it as a deferral rather than a failure also stops it from inflating KeeperReadSnapshotFailed.

Introduced by #104941. Surfaced by the test_keeper_snapshot_rotation_race integration test on master (Logical error: 'false' in read_logical_snp_obj, and KeeperReadSnapshotFailed delta = 1). Added a deterministic unit test (ReadSnapshotChunkWithoutContextDoesNotCrash) that drives the exact state: it aborts the server on the current code and passes with this fix.

read_logical_snp_obj aborted with a chassert(false) when called with
obj_id > 0 and a null user_snp_ctx. That state is reachable during
snapshot rotation: the obj_id == 0 request can return -1 without
creating a transfer context (the requested snapshot was retired so its
pin is missing, or the loader failed to init). NuRaft then resets the
peer's snapshot_sync_ctx but keeps its offset > 0, so the next
async_io_loop iteration asks for a later chunk while the fresh context
has user_snp_ctx == nullptr, and the leader aborts.

This is a legitimate stale request, not an invariant violation. Decline
it the same way as the missing-pin deferral (free the context, count it
as KeeperReadSnapshotDeferred, return -1); NuRaft restarts the install
from obj_id 0 against the latest snapshot. Marking it as a deferral
rather than a failure also stops it from inflating
KeeperReadSnapshotFailed.

Add a unit test that drives read_logical_snp_obj with obj_id > 0 and no
context and asserts it declines instead of crashing, then verifies a
normal transfer still succeeds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @antonio2368 @Michicosun - could you review? This fixes a Keeper leader crash (chassert(false) in read_logical_snp_obj) when NuRaft requests a snapshot chunk with obj_id > 0 after the peer's sync context was reset during rotation. The stale chunk is now declined as a deferral instead of aborting. Introduced by #104941; surfaced by test_keeper_snapshot_rotation_race on master.

@antonio2368

Copy link
Copy Markdown
Member

@groeneai check if the last failures were actually fixed by #106842

@groeneai

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants