Increase restart timeout in test_keeper_snapshot_chunked_transfer recover tests by groeneai · Pull Request #108557 · ClickHouse/ClickHouse · GitHub
Skip to content

Increase restart timeout in test_keeper_snapshot_chunked_transfer recover tests#108557

Merged
alexey-milovidov merged 3 commits into
ClickHouse:masterfrom
groeneai:fix-keeper-snapshot-recover-startup-timeout
Jun 30, 2026
Merged

Increase restart timeout in test_keeper_snapshot_chunked_transfer recover tests#108557
alexey-milovidov merged 3 commits into
ClickHouse:masterfrom
groeneai:fix-keeper-snapshot-recover-startup-timeout

Conversation

@groeneai

@groeneai groeneai commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Related: #102040
Related: #101969

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Two related CI fixes for the test_keeper_snapshot_chunked_transfer suite.

1. Restart timeout (test.py + test_concurrent.py).
test_recover_* restart a SIGKILL'd follower and wait for it to come back. Recovery
there means replaying raft logs and installing a snapshot (over S3, with a 1 KiB read
buffer to exercise the chunked loader) before the server accepts connections. Under
msan/tsan that legitimately takes a long time.

In a recent CI run on Integration tests (amd_msan, 5/6), the recovering node reached
Application: Ready for connections ~95 s after launch (server.log: launched at
19:58:40, ready at 20:00:15), with no crash and no sanitizer report. But the test
called start_clickhouse(20), which waits only 20 s and then raises
Cannot start ClickHouse, failing the test. This is a too-tight test-side timeout, not
a server bug.

Fix: replace the magic 20 with a named RESTART_TIMEOUT_SECONDS = 180 in both files.
start_clickhouse returns as soon as the server is ready, so the larger upper bound
costs nothing on fast runs; it only avoids a false failure when recovery legitimately
takes 60-120 s under a sanitizer.

CI report (the msan run that triggered this): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108177&sha=c8966fe52bc3ca169cee306629a05478aaa06057&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%205%2F6%29

2. Run the suite sequentially in flaky/targeted checks (integration_tests_configs.py).
The flaky/targeted integration checks run with --dist=each -n 3, so all 3 xdist workers
run the same modules concurrently, each with its own isolated Docker cluster. test.py
brings up an 18-instance module-scoped Keeper+S3/minio cluster (test_concurrent.py adds
10 more), so under --dist=each the runner holds roughly 3x that many ASAN ClickHouse
servers up at once and the docker memory cgroup OOM-killer kills pytest (job exit 137),
even with the raised MAX_MEM_PER_WORKER_DIST_EACH budget. This surfaced on this PR's own
Integration tests (amd_asan_ubsan, flaky) run: the changed tests PASSED 6/6, then the job
was OOM-killed mid-run.

Fix: mark the suite is_sequential=True in TEST_CONFIGS so flaky/targeted checks route
it through the -n 1 sequential path (one cluster at a time, no OOM). The flaky check still
loops sequential modules max(3, workers) times, so flakiness coverage is unchanged. Normal
jobs use --dist=loadfile and are unaffected (one module = one cluster per file).

CI report (the asan flaky run that OOM-ed): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108557&sha=190238853238c9788c758a3c23ebcf761ccac6f4&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @antonio2368 — could you review this? Test-only change: the test_keeper_snapshot_chunked_transfer recover tests restart a SIGKILL'd follower whose snapshot recovery legitimately takes ~95 s under msan/tsan, but they waited only start_clickhouse(20) and failed with Cannot start ClickHouse. Replaces the magic 20 with RESTART_TIMEOUT_SECONDS = 180 (an upper bound only; start returns as soon as the server is ready).

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 26, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [3018ef9]

Summary:


AI Review

Summary
  • The PR raises the test_keeper_snapshot_chunked_transfer restart wait from a hardcoded 20 seconds to RESTART_TIMEOUT_SECONDS = 180, and marks the suite dist_each_sequential=True so flaky/targeted --dist=each runs do not start multiple full Keeper+S3 clusters concurrently. I did not find any correctness or CI-routing issue that needs an inline review comment.
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 26, 2026
@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 1902388

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
Integration tests (amd_asan_ubsan, flaky) host-OOM (docker memcg OOM-killer killed pytest/dockerd ~5min in; our changed test_recover_* tests PASSED 6/6, so NOT flaky and NOT PR-caused; this PR is a timeout-bound-only change adding zero memory) a fix task is created (investigating at full effort, fixing-PR link to follow here)
CH Inc sync - CH Inc sync (private, not actionable by us)

The flaky-check ran this PR's changed tests many times and they passed every iteration; the job died from the docker memory cgroup OOM-killer (dmesg confirms), not a test failure or budget timeout. Finish Workflow + Style check pass on this head.

Session id: cron:our-pr-ci-monitor:20260626-170000

@groeneai

Copy link
Copy Markdown
Contributor Author

Owner update for the Integration tests (amd_asan_ubsan, flaky) OOM ledger line

The fix for the flaky-check docker-memcg OOM is now in this PR (commit 081430b14f7), so this PR is its own fixing PR for that ledger line.

Root cause (confirmed from job.log). The flaky/targeted integration checks run with --dist=each -n 3 on a 16 CPU / 61 GB runner, so worker count is min(16//5, 61//20) = 3 and all 3 xdist workers run the same modules concurrently, each with its own isolated Docker cluster. test.py brings up an 18-instance module-scoped Keeper+S3/minio cluster (test_concurrent.py adds 10 more), so under --dist=each the runner holds roughly 3x18 ASAN ClickHouse servers + minio up at once. The docker memory cgroup OOM-killer killed pytest at 13:55:09 (3rd test, test_recover_after_interrupted_transfer[local_disk], exit -9 -> job 137). Our changed test_recover_* tests PASSED 6/6 first, so this is neither a flaky test nor PR-caused.

This is not covered by the already-merged MAX_MEM_PER_WORKER_DIST_EACH=20 budget (that commit is on this branch base, yet 3x20=60 GB on a 61 GB box left no headroom for an 18-instance ASAN cluster).

Fix. Mark the suite is_sequential=True in ci/jobs/scripts/integration_tests_configs.py TEST_CONFIGS. Flaky/targeted checks then route both modules through the -n 1 sequential path (one cluster at a time, no OOM). The flaky check still loops sequential modules max(3, workers) times, so flakiness coverage is unchanged. Normal --dist=loadfile jobs are unaffected (one module = one cluster per file). This mirrors the existing test_server_overload / test_storage_delta/test_azure_cluster.py entries that are sequential for resource-contention reasons.

CI report (the asan flaky run that OOM-ed): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108557&sha=190238853238c9788c758a3c23ebcf761ccac6f4&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29

Session id: cron:clickhouse-worker-slot-4:20260626-172700

@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate for commit 081430b14f7 (the sequential-routing OOM fix).

Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. With 16 CPU / 61 GB the flaky check always computes workers = min(16//5, 61//20) = 3 and runs --dist=each -n 3, so all 3 workers hold the 18-instance test.py cluster up at once. Reproduced on this PR's own asan flaky run (job exit 137, docker memcg OOM-killer in dmesg at 13:55:09).
b Root cause explained? --dist=each makes every xdist worker run the same modules concurrently, each with its own Docker cluster. The suite's test.py is an 18-instance Keeper+S3/minio module-scoped cluster (10 more in test_concurrent.py). 3 concurrent copies = ~54 ASAN ClickHouse servers + minio on a 61 GB box; the docker memory cgroup OOM-killer kills pytest. The already-merged MAX_MEM_PER_WORKER_DIST_EACH=20 budget gives 3x20=60 GB nominal, no headroom for an 18-instance ASAN cluster.
c Fix matches root cause? Yes. The root cause is N-way concurrency under --dist=each. Marking the suite is_sequential=True routes it through the -n 1 sequential path, so only one cluster is up at a time. Directly removes the concurrency that causes the OOM, not a symptom.
d Test intent preserved / new tests added? Yes. No test assertion or behavior is changed; this is a CI runner routing change only. Flaky-check coverage is unchanged: the sequential path still loops the modules max(3, workers) times (same execution count the 3 --dist=each workers gave).
e Both directions demonstrated? Yes (by routing assertion). Before the change get_optimal_test_batch placed both modules in the PARALLEL group (which is what --dist=each OOM-ed); after, a standalone call places both in the SEQUENTIAL group and the PARALLEL group is empty. Existing sequential entries (test_server_overload) are unchanged. python3 -m py_compile passes.
f Fix is general across code paths? Yes. Both modules of the suite (test.py + test_concurrent.py) are covered by the single prefix test_keeper_snapshot_chunked_transfer/. This is the same idiomatic mechanism used for other resource-contention suites (test_server_overload, test_storage_delta/test_azure_cluster.py).
g Fix generalizes across inputs? N/A (CI test-runner routing change, not a code bug).
h Backward compatible? N/A (CI test-runner routing change, no user-facing behavior, no settings/format). Normal --dist=loadfile jobs are unaffected: is_sequential only diverts from --dist=each N-way concurrency, and loadfile already runs one cluster per module file.
i Invariants and contracts preserved? Yes. The parallel_skip_prefixes sanity check (integration_tests_configs.py:332) still holds: in every non-flaky batch tests is the full ./tests/integration/test_*/test*.py glob, and in flaky/targeted mode get_parallel_sequential_tests_to_run calls get_optimal_test_batch with that same full glob, so the new prefix always matches at least one file. Verified the glob contains both modules.

Session id: cron:clickhouse-worker-slot-4:20260626-172700

@clickhouse-gh clickhouse-gh Bot added the manual approve Manual approve required to run CI label Jun 26, 2026
@groeneai

Copy link
Copy Markdown
Contributor Author

cc @Algunenano — could you take a look at the second commit? It marks test_keeper_snapshot_chunked_transfer is_sequential=True in ci/jobs/scripts/integration_tests_configs.py so the flaky/targeted integration checks (--dist=each) stop OOM-killing the runner with 3 concurrent copies of its 18-node Keeper+S3 cluster. Same mechanism as the existing test_server_overload / test_storage_delta/test_azure_cluster.py entries. Normal --dist=loadfile jobs are unaffected.

@antonio2368 antonio2368 self-assigned this Jun 29, 2026
),
TC("test_random_inserts/", False, "standard replicated inserts test; cluster is fully isolated"),
TC("test_server_overload/", True, "uses taskset to pin ClickHouse to specific CPU cores; sensitive to concurrent CPU load"),
TC("test_keeper_snapshot_chunked_transfer/", True, "18-node Keeper+S3 cluster; concurrent --dist=each copies OOM the ASAN runner"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be something like

  TC(
      "test_keeper_snapshot_chunked_transfer/",
      False,
      "18-node Keeper+S3 cluster; concurrent --dist=each copies OOM the ASAN runner",
      dist_each_sequential=True,
  ),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 10b5ee3 — added dist_each_sequential to TC and used exactly that form. It routes the suite sequential only under --dist=each (flaky/targeted); regular --dist=loadfile jobs keep it parallel. Verified locally: keeper -> sequential under flaky/targeted, parallel under regular/llvm_coverage; is_sequential=True configs unchanged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased onto master (was conflicting). Master #107985 independently added the same dist_each_sequential field + force_heavy_modules_sequential routing, so I dropped my framework code and the branch now adds only the one keeper TC(...) entry in exactly the form above. Master's regression suite ci/tests/test_force_heavy_modules_sequential.py passes (8/8, incl. the prefix-matches-real-path guard); routing reverified: keeper sequential under --dist=each, parallel under --dist=loadfile. Diff is now 3 files +25/-7.

groeneai and others added 2 commits June 29, 2026 11:39
…over tests

The test_recover_* subtests SIGKILL a follower and restart it; recovery means
replaying raft logs and installing a snapshot (over S3 with a 1 KiB read buffer)
before the server accepts connections. Under msan/tsan this legitimately took
~95 s in CI (server.log: 'Ready for connections' ~95 s after launch, no crash or
sanitizer report), but start_clickhouse(20) waited only 20 s and raised
'Cannot start ClickHouse'.

Replace the magic 20 with a named RESTART_TIMEOUT_SECONDS = 180 in both test.py
and test_concurrent.py. start_clickhouse returns as soon as the server is ready,
so the larger upper bound costs nothing on fast runs and only stops the test
failing when recovery legitimately takes 60-120 s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mark the suite dist_each_sequential=True. The flaky/targeted integration checks
run the parallel bucket with --dist=each, so every xdist worker brings up its own
full 18-node Keeper+S3 cluster of test.py concurrently (3 copies on the 61 GiB
AMD_MEDIUM ASan runner), which the kernel OOM-killed (job exit 137). The new flag
routes both test.py and test_concurrent.py to the looped sequential -n 1 bucket
under --dist=each (one cluster at a time, repeated >=3x, so the flakiness signal
is preserved) while leaving them parallel under --dist=loadfile in the ~140 normal
integration jobs (one file -> one worker -> one cluster, which never OOMs).

The dist_each_sequential field and force_heavy_modules_sequential routing landed
in master via ClickHouse#107985; this only adds the keeper suite to TEST_CONFIGS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the fix-keeper-snapshot-recover-startup-timeout branch from 10b5ee3 to 3b92000 Compare June 29, 2026 11:42
Pick up revert e531152 of PR ClickHouse#105591 (vector search rescoring exact
row-positioning) which caused 02354_vector_search_rescoring to fail with
ILLEGAL_COLUMN on _distance. Our branch was rebased onto the pre-revert
master pin d562138.
@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 3b92000

CI fully finished on this SHA (Finish Workflow=pass; only CH Inc sync still pending, which is exempt). Every failure below has an owner.

Check / test Reason Owner / fixing PR
Stateless tests (amd_llvm_coverage, 3/3 + AsyncInsert s3 + old-analyzer s3 DatabaseReplicated WasmEdge) / 02354_vector_search_rescoring trunk regression (ILLEGAL_COLUMN on _distance with vector_search_with_rescoring=0, useVectorSearch.cpp:237) Fixing commit e531152b092 (revert of #105591, merged on master). Our branch was rebased onto the pre-revert pin d56213874eb; merged master @ 3018ef99f08 to pick up the revert.
CH Inc sync CH Inc sync (private, not actionable by us)

This PR touches only the keeper-snapshot integration tests + the integration-test CI config; it cannot affect a vector-search stateless test. The failure was the brief 2026-06-29 master spike from #105591 (merged 11:10 UTC, reverted 12:59 UTC, 46 master hits + 18 unrelated PRs in the 2h window) which our pre-revert rebase base captured.

Session id: cron:our-pr-ci-monitor:20260630-123000

@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.74% 85.74% +0.01%
Functions 91.47% 91.47% +0.00%
Branches 78.34% 78.36% +0.02%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 48 line(s), 2 function(s) across 12 file(s) · Details

Top files
  • src/Backups/BackupEntriesCollector.cpp: 25 line(s), 2 function(s)
  • src/Processors/Sources/MySQLSource.cpp: 5 line(s)
  • src/Storages/ConstraintsDescription.cpp: 5 line(s)
  • src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.cpp: 3 line(s)
  • base/base/chrono_io.h: 2 line(s)

Full report

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 30, 2026
Merged via the queue into ClickHouse:master with commit 7c98f89 Jun 30, 2026
174 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants