Fix flaky test_keeper_disks::test_snapshots_with_disks#108513
Conversation
The test asserted len(local_snapshot_files) == 1 immediately after the
"Created persistent snapshot {idx}" log line and intermittently observed 2
(assert 2 == 1).
When a separate latest_snapshot_storage_disk is configured, creating a
snapshot relocates the previous (non-latest) snapshot off the local "latest"
disk to the main disk. In KeeperStateMachine::create_snapshot that log line is
emitted in the under-lock publish step (publishWrittenSnapshot), while the move
runs afterwards in the deferred, outside-the-lock maintenance step
(runSnapshotMaintenance -> moveSnapshotCandidate). selectSnapshotsToMove also
defers a move while the old snapshot is transiently pinned (use_count > 1).
Checking the count once therefore races the still-pending move.
The move is eventually consistent, so poll for the local count to settle to 1
(up to 60s) instead of asserting once, mirroring the assert_single_local_log
helper already used by test_logs_with_disks. Test-only change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @Michicosun — could you review this? Test-only fix: |
|
Workflow [PR], commit [8941881] Summary: ✅ AI ReviewSummaryThis PR fixes Final VerdictStatus: ✅ Approve |
…ts/logs The set-equality check in test_snapshots_with_disks (and the count checks) listed local and S3 files without filtering out the empty "tmp_<name>" markers that Keeper writes on the target disk while moving a snapshot/log between disks (moveFileBetweenDisks in KeeperCommon.cpp). When a listing was taken while a move was in flight, the transient marker was captured into previous_snapshot_files; after the next restart the real file remained, so set(local) != set(previous) failed with a lone tmp_-prefixed entry on the expected side. The server itself never treats tmp_-prefixed entries as snapshots/logs (it skips them when scanning disks), so the test should not either. Exclude them from get_local_files and list_s3_objects; cleanup still wipes them via exclude_tmp=False. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Follow-up commit (8941881) extends this PR to also fix the set-equality variant of the same flaky test, which is now the dominant failure mode and also hits master. Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-1:20260627-061200 |
LLVM Coverage Report
Changed lines: No C/C++ source files changed — skipping uncovered code analysis. Newly covered by added/modified tests: 730 line(s), 91 function(s) across 144 file(s) · Details Top files
|
CI finish ledger — 8941881CI fully finished on this head (Finish Workflow + Mergeable Check + Config Workflow SUCCESS; no checks queued/in_progress). Every failure has an owner. No test FAIL/ERROR rows in CIDB on this commit. The follow-up commit 8941881 (tmp_-marker exclusion in the test listings, on top of the polling fix) ran clean across the full PR matrix incl. the previously-flaky Session id: cron:our-pr-ci-monitor:20260627-100000 |
Pick up merged PR ClickHouse#108513 (fix flaky test_keeper_disks::test_snapshots_with_disks) per @alexey-milovidov request on PR ClickHouse#108568.

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
Fixes flaky
test_keeper_disks/test.py::test_snapshots_with_disks. Two distinct races, same area (cross-disk snapshot moves); both are test-only issues, no server behavior is changed.1) Local snapshot count race (
assert len(local_snapshot_files) == 1->assert 2 == 1).When a separate
latest_snapshot_storage_diskis configured, creating a snapshot relocates the previous (non-latest) snapshot off the local "latest" disk. InKeeperStateMachine::create_snapshottheCreated persistent snapshot {idx}log line is emitted in the under-lock publish step (publishWrittenSnapshot), while the actual move runs afterwards in the deferred, outside-the-lock maintenance step (runSnapshotMaintenance->moveSnapshotCandidate). The test waited on that log line and checked the local count once, racing the still-pending move (selectSnapshotsToMovealso defers a move while the old snapshot is transiently pinned,use_count > 1). The move is eventually consistent, so the fix polls for the count to settle to 1 (up to 60s), mirroring theassert_single_local_loghelper already used bytest_logs_with_disks.2) Transient
tmp_marker counted as a snapshot (assert set(local_snapshot_files) == set(previous_snapshot_files)).This is now the dominant variant and also hits master.
moveFileBetweenDisks(KeeperCommon.cpp) writes an emptytmp_<name>marker on the target disk during a move and removes it once the copy completes. The test'slist_s3_objects/get_local_fileshelpers listed everything, so a listing taken while a move was in flight captured the transienttmp_snapshot_<idx>marker intoprevious_snapshot_files. After the next restart only the realsnapshot_<idx>remained, so the set comparison failed with a lonetmp_-prefixed entry on the expected side. The server itself never treatstmp_-prefixed entries as snapshots/logs (it skips them when scanning disks), so the helpers now exclude them too; the disk-cleanup path still wipes them (exclude_tmp=False).Validation: for (1), injecting an 8s delay before the move made the bare assertion fail 3/3 and the polling version pass 3/3. For (2), injecting a transient
tmp_marker at the listing point made the raw listing fail with the exact CI signature ({snapshot_34}vs{snapshot_34, tmp_snapshot_34}) and thetmp_-excluding listing pass. The full test passed 6/6 (both tests, random order) on a clean build.CI failure (variant 2, master): https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=b3461b9116ed78b84a6c37fa846134523a46e21f&name_0=MasterCI&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%202%2F4%29
CI failure (variant 1): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108458&sha=c79ebbf375b04bea319369ed0efff2d275aae491&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%202%2F4%29
Version info
26.7.1.175