Increase restart timeout in test_keeper_snapshot_chunked_transfer recover tests#108557
Conversation
|
cc @antonio2368 — could you review this? Test-only change: the |
|
Workflow [PR], commit [3018ef9] Summary: ✅ AI ReviewSummary
Final Verdict
|
CI finish ledger — 1902388Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only
The flaky-check ran this PR's changed tests many times and they passed every iteration; the job died from the docker memory cgroup OOM-killer (dmesg confirms), not a test failure or budget timeout. Finish Workflow + Style check pass on this head. Session id: cron:our-pr-ci-monitor:20260626-170000 |
Owner update for the
|
|
Pre-PR validation gate for commit Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-4:20260626-172700 |
|
cc @Algunenano — could you take a look at the second commit? It marks |
| ), | ||
| TC("test_random_inserts/", False, "standard replicated inserts test; cluster is fully isolated"), | ||
| TC("test_server_overload/", True, "uses taskset to pin ClickHouse to specific CPU cores; sensitive to concurrent CPU load"), | ||
| TC("test_keeper_snapshot_chunked_transfer/", True, "18-node Keeper+S3 cluster; concurrent --dist=each copies OOM the ASAN runner"), |
There was a problem hiding this comment.
Should this be something like
TC(
"test_keeper_snapshot_chunked_transfer/",
False,
"18-node Keeper+S3 cluster; concurrent --dist=each copies OOM the ASAN runner",
dist_each_sequential=True,
),
There was a problem hiding this comment.
Done in 10b5ee3 — added dist_each_sequential to TC and used exactly that form. It routes the suite sequential only under --dist=each (flaky/targeted); regular --dist=loadfile jobs keep it parallel. Verified locally: keeper -> sequential under flaky/targeted, parallel under regular/llvm_coverage; is_sequential=True configs unchanged.
There was a problem hiding this comment.
Rebased onto master (was conflicting). Master #107985 independently added the same dist_each_sequential field + force_heavy_modules_sequential routing, so I dropped my framework code and the branch now adds only the one keeper TC(...) entry in exactly the form above. Master's regression suite ci/tests/test_force_heavy_modules_sequential.py passes (8/8, incl. the prefix-matches-real-path guard); routing reverified: keeper sequential under --dist=each, parallel under --dist=loadfile. Diff is now 3 files +25/-7.
…over tests The test_recover_* subtests SIGKILL a follower and restart it; recovery means replaying raft logs and installing a snapshot (over S3 with a 1 KiB read buffer) before the server accepts connections. Under msan/tsan this legitimately took ~95 s in CI (server.log: 'Ready for connections' ~95 s after launch, no crash or sanitizer report), but start_clickhouse(20) waited only 20 s and raised 'Cannot start ClickHouse'. Replace the magic 20 with a named RESTART_TIMEOUT_SECONDS = 180 in both test.py and test_concurrent.py. start_clickhouse returns as soon as the server is ready, so the larger upper bound costs nothing on fast runs and only stops the test failing when recovery legitimately takes 60-120 s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mark the suite dist_each_sequential=True. The flaky/targeted integration checks run the parallel bucket with --dist=each, so every xdist worker brings up its own full 18-node Keeper+S3 cluster of test.py concurrently (3 copies on the 61 GiB AMD_MEDIUM ASan runner), which the kernel OOM-killed (job exit 137). The new flag routes both test.py and test_concurrent.py to the looped sequential -n 1 bucket under --dist=each (one cluster at a time, repeated >=3x, so the flakiness signal is preserved) while leaving them parallel under --dist=loadfile in the ~140 normal integration jobs (one file -> one worker -> one cluster, which never OOMs). The dist_each_sequential field and force_heavy_modules_sequential routing landed in master via ClickHouse#107985; this only adds the keeper suite to TEST_CONFIGS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
10b5ee3 to
3b92000
Compare
Pick up revert e531152 of PR ClickHouse#105591 (vector search rescoring exact row-positioning) which caused 02354_vector_search_rescoring to fail with ILLEGAL_COLUMN on _distance. Our branch was rebased onto the pre-revert master pin d562138.
CI finish ledger — 3b92000CI fully finished on this SHA (Finish Workflow=pass; only
This PR touches only the keeper-snapshot integration tests + the integration-test CI config; it cannot affect a vector-search stateless test. The failure was the brief 2026-06-29 master spike from #105591 (merged 11:10 UTC, reverted 12:59 UTC, 46 master hits + 18 unrelated PRs in the 2h window) which our pre-revert rebase base captured. Session id: cron:our-pr-ci-monitor:20260630-123000 |
LLVM Coverage ReportChanged lines: No C/C++ source files changed — skipping uncovered code analysis. Newly covered by added/modified tests: 48 line(s), 2 function(s) across 12 file(s) · Details Top files
|

Related: #102040
Related: #101969
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
Two related CI fixes for the
test_keeper_snapshot_chunked_transfersuite.1. Restart timeout (test.py + test_concurrent.py).
test_recover_*restart a SIGKILL'd follower and wait for it to come back. Recoverythere means replaying raft logs and installing a snapshot (over S3, with a 1 KiB read
buffer to exercise the chunked loader) before the server accepts connections. Under
msan/tsan that legitimately takes a long time.
In a recent CI run on
Integration tests (amd_msan, 5/6), the recovering node reachedApplication: Ready for connections~95 s after launch (server.log: launched at19:58:40, ready at20:00:15), with no crash and no sanitizer report. But the testcalled
start_clickhouse(20), which waits only 20 s and then raisesCannot start ClickHouse, failing the test. This is a too-tight test-side timeout, nota server bug.
Fix: replace the magic
20with a namedRESTART_TIMEOUT_SECONDS = 180in both files.start_clickhousereturns as soon as the server is ready, so the larger upper boundcosts nothing on fast runs; it only avoids a false failure when recovery legitimately
takes 60-120 s under a sanitizer.
CI report (the msan run that triggered this): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108177&sha=c8966fe52bc3ca169cee306629a05478aaa06057&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%205%2F6%29
2. Run the suite sequentially in flaky/targeted checks (integration_tests_configs.py).
The flaky/targeted integration checks run with
--dist=each -n 3, so all 3 xdist workersrun the same modules concurrently, each with its own isolated Docker cluster.
test.pybrings up an 18-instance module-scoped Keeper+S3/minio cluster (
test_concurrent.pyadds10 more), so under
--dist=eachthe runner holds roughly 3x that many ASAN ClickHouseservers up at once and the docker memory cgroup OOM-killer kills pytest (job exit 137),
even with the raised
MAX_MEM_PER_WORKER_DIST_EACHbudget. This surfaced on this PR's ownIntegration tests (amd_asan_ubsan, flaky)run: the changed tests PASSED 6/6, then the jobwas OOM-killed mid-run.
Fix: mark the suite
is_sequential=TrueinTEST_CONFIGSso flaky/targeted checks routeit through the
-n 1sequential path (one cluster at a time, no OOM). The flaky check stillloops sequential modules
max(3, workers)times, so flakiness coverage is unchanged. Normaljobs use
--dist=loadfileand are unaffected (one module = one cluster per file).CI report (the asan flaky run that OOM-ed): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108557&sha=190238853238c9788c758a3c23ebcf761ccac6f4&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29