Fix flaky test_keeper_disks::test_snapshots_with_disks by groeneai · Pull Request #108513 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix flaky test_keeper_disks::test_snapshots_with_disks#108513

Merged
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-flaky-keeper-disks-snapshots
Jun 27, 2026
Merged

Fix flaky test_keeper_disks::test_snapshots_with_disks#108513
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-flaky-keeper-disks-snapshots

Conversation

@groeneai

@groeneai groeneai commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Fixes flaky test_keeper_disks/test.py::test_snapshots_with_disks. Two distinct races, same area (cross-disk snapshot moves); both are test-only issues, no server behavior is changed.

1) Local snapshot count race (assert len(local_snapshot_files) == 1 -> assert 2 == 1).
When a separate latest_snapshot_storage_disk is configured, creating a snapshot relocates the previous (non-latest) snapshot off the local "latest" disk. In KeeperStateMachine::create_snapshot the Created persistent snapshot {idx} log line is emitted in the under-lock publish step (publishWrittenSnapshot), while the actual move runs afterwards in the deferred, outside-the-lock maintenance step (runSnapshotMaintenance -> moveSnapshotCandidate). The test waited on that log line and checked the local count once, racing the still-pending move (selectSnapshotsToMove also defers a move while the old snapshot is transiently pinned, use_count > 1). The move is eventually consistent, so the fix polls for the count to settle to 1 (up to 60s), mirroring the assert_single_local_log helper already used by test_logs_with_disks.

2) Transient tmp_ marker counted as a snapshot (assert set(local_snapshot_files) == set(previous_snapshot_files)).
This is now the dominant variant and also hits master. moveFileBetweenDisks (KeeperCommon.cpp) writes an empty tmp_<name> marker on the target disk during a move and removes it once the copy completes. The test's list_s3_objects / get_local_files helpers listed everything, so a listing taken while a move was in flight captured the transient tmp_snapshot_<idx> marker into previous_snapshot_files. After the next restart only the real snapshot_<idx> remained, so the set comparison failed with a lone tmp_-prefixed entry on the expected side. The server itself never treats tmp_-prefixed entries as snapshots/logs (it skips them when scanning disks), so the helpers now exclude them too; the disk-cleanup path still wipes them (exclude_tmp=False).

Validation: for (1), injecting an 8s delay before the move made the bare assertion fail 3/3 and the polling version pass 3/3. For (2), injecting a transient tmp_ marker at the listing point made the raw listing fail with the exact CI signature ({snapshot_34} vs {snapshot_34, tmp_snapshot_34}) and the tmp_-excluding listing pass. The full test passed 6/6 (both tests, random order) on a clean build.

CI failure (variant 2, master): https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=b3461b9116ed78b84a6c37fa846134523a46e21f&name_0=MasterCI&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%202%2F4%29
CI failure (variant 1): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108458&sha=c79ebbf375b04bea319369ed0efff2d275aae491&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%202%2F4%29

Version info

  • Merged into: 26.7.1.175

The test asserted len(local_snapshot_files) == 1 immediately after the
"Created persistent snapshot {idx}" log line and intermittently observed 2
(assert 2 == 1).

When a separate latest_snapshot_storage_disk is configured, creating a
snapshot relocates the previous (non-latest) snapshot off the local "latest"
disk to the main disk. In KeeperStateMachine::create_snapshot that log line is
emitted in the under-lock publish step (publishWrittenSnapshot), while the move
runs afterwards in the deferred, outside-the-lock maintenance step
(runSnapshotMaintenance -> moveSnapshotCandidate). selectSnapshotsToMove also
defers a move while the old snapshot is transiently pinned (use_count > 1).
Checking the count once therefore races the still-pending move.

The move is eventually consistent, so poll for the local count to settle to 1
(up to 60s) instead of asserting once, mirroring the assert_single_local_log
helper already used by test_logs_with_disks. Test-only change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @Michicosun — could you review this? Test-only fix: test_keeper_disks::test_snapshots_with_disks checked the local snapshot count once right after the "Created persistent snapshot" log line, but the previous snapshot is relocated off the local disk later in the deferred runSnapshotMaintenance step, so it raced and saw 2 files. The fix polls for the count to settle (mirrors assert_single_local_log already used by the sibling test_logs_with_disks).

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 25, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [8941881]

Summary:


AI Review

Summary

This PR fixes test_keeper_disks::test_snapshots_with_disks by making the snapshot assertions wait for Keeper's deferred cross-disk maintenance and by filtering transient tmp_ marker files from the test helpers while still deleting them during cleanup. The changed test behavior matches the production snapshot/log scan semantics and the moveFileBetweenDisks ordering; I did not find a correctness issue.

Final Verdict

Status: ✅ Approve

…ts/logs

The set-equality check in test_snapshots_with_disks (and the count checks) listed
local and S3 files without filtering out the empty "tmp_<name>" markers that Keeper
writes on the target disk while moving a snapshot/log between disks (moveFileBetweenDisks
in KeeperCommon.cpp). When a listing was taken while a move was in flight, the transient
marker was captured into previous_snapshot_files; after the next restart the real file
remained, so set(local) != set(previous) failed with a lone tmp_-prefixed entry on the
expected side. The server itself never treats tmp_-prefixed entries as snapshots/logs
(it skips them when scanning disks), so the test should not either. Exclude them from
get_local_files and list_s3_objects; cleanup still wipes them via exclude_tmp=False.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Follow-up commit (8941881) extends this PR to also fix the set-equality variant of the same flaky test, which is now the dominant failure mode and also hits master.

Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. Injecting a transient tmp_snapshot_<idx> marker on S3 at the listing point reproduces the exact failure every run. Validated via a harness test against a real Keeper server + MinIO.
b Root cause explained? Yes. moveFileBetweenDisks (KeeperCommon.cpp) writes an empty tmp_<name> marker on the target disk during a cross-disk move and removes it after the copy. A listing taken during an in-flight move captured tmp_snapshot_<idx> into previous_snapshot_files; after the next restart only real snapshot_<idx> remained, so set(local) == set(previous) failed with a lone tmp_-prefixed entry on the expected side.
c Fix matches root cause? Yes. The helpers now exclude tmp_-prefixed entries, exactly as the server does when scanning disks (it never treats tmp_ files as snapshots/logs). Not a band-aid: the markers are genuinely not snapshot/log files.
d Test intent preserved / new tests added? Yes. The set/count assertions still verify the snapshot relocation behavior; only spurious internal markers are dropped. No assertion weakened. (Harness tests used for validation were not committed.)
e Both directions demonstrated? Yes. Raw listing -> FAIL with the exact CI signature ({snapshot_34} vs {snapshot_34, tmp_snapshot_34}); tmp_-excluding listing -> PASS. Full test 6/6 (both tests, random order) on a clean build.
f Fix is general across code paths? Yes. Applied to both get_local_files (local ls, covers snapshots and logs) and list_s3_objects (S3). The cleanup path keeps exclude_tmp=False so it still wipes markers.
g Fix generalizes across inputs? N/A (test-only listing-filter fix, not a code bug).
h Backward compatible? N/A (test-only change; no server behavior, settings, or formats changed).
i Invariants and contracts preserved? N/A (test-only change). The helpers' contract is now "return real snapshot/log files", matching the server's own disk-scan semantics.

Session id: cron:clickhouse-worker-slot-1:20260627-061200

@clickhouse-gh

clickhouse-gh Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.30% 85.40% +0.10%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 730 line(s), 91 function(s) across 144 file(s) · Details

Top files
  • src/Common/Dwarf.cpp: 205 line(s), 11 function(s)
  • src/Common/SignalHandlers.cpp: 105 line(s), 4 function(s)
  • src/Backups/BackupEntriesCollector.cpp: 27 line(s), 2 function(s)
  • src/AggregateFunctions/TimeSeries/AggregateFunctionTimeseriesChanges.h: 14 line(s), 32 function(s)
  • src/Common/StackTrace.cpp: 12 line(s), 5 function(s)

Full report

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 8941881

CI fully finished on this head (Finish Workflow + Mergeable Check + Config Workflow SUCCESS; no checks queued/in_progress). Every failure has an owner.

Check / test Reason Owner / fixing PR
CH Inc sync private fork-mirror sync CH Inc sync (private, not actionable by us)

No test FAIL/ERROR rows in CIDB on this commit. The follow-up commit 8941881 (tmp_-marker exclusion in the test listings, on top of the polling fix) ran clean across the full PR matrix incl. the previously-flaky Integration tests (... distributed plan ...) lane.

Session id: cron:our-pr-ci-monitor:20260627-100000

@alexey-milovidov alexey-milovidov self-assigned this Jun 27, 2026
@alexey-milovidov alexey-milovidov merged commit 4dd3b22 into ClickHouse:master Jun 27, 2026
168 of 169 checks passed
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 27, 2026
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 29, 2026
Pick up merged PR ClickHouse#108513 (fix flaky test_keeper_disks::test_snapshots_with_disks)
per @alexey-milovidov request on PR ClickHouse#108568.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants