iframe-proxy

groeneai · 2026-06-25T23:00:48Z

Related: #102040
Related: #101969

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Two related CI fixes for the test_keeper_snapshot_chunked_transfer suite.

1. Restart timeout (test.py + test_concurrent.py).
test_recover_* restart a SIGKILL'd follower and wait for it to come back. Recovery
there means replaying raft logs and installing a snapshot (over S3, with a 1 KiB read
buffer to exercise the chunked loader) before the server accepts connections. Under
msan/tsan that legitimately takes a long time.

In a recent CI run on Integration tests (amd_msan, 5/6), the recovering node reached
Application: Ready for connections ~95 s after launch (server.log: launched at
19:58:40, ready at 20:00:15), with no crash and no sanitizer report. But the test
called start_clickhouse(20), which waits only 20 s and then raises
Cannot start ClickHouse, failing the test. This is a too-tight test-side timeout, not
a server bug.

Fix: replace the magic 20 with a named RESTART_TIMEOUT_SECONDS = 180 in both files.
start_clickhouse returns as soon as the server is ready, so the larger upper bound
costs nothing on fast runs; it only avoids a false failure when recovery legitimately
takes 60-120 s under a sanitizer.

CI report (the msan run that triggered this): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108177&sha=c8966fe52bc3ca169cee306629a05478aaa06057&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%205%2F6%29

2. Run the suite sequentially in flaky/targeted checks (integration_tests_configs.py).
The flaky/targeted integration checks run with --dist=each -n 3, so all 3 xdist workers
run the same modules concurrently, each with its own isolated Docker cluster. test.py
brings up an 18-instance module-scoped Keeper+S3/minio cluster (test_concurrent.py adds
10 more), so under --dist=each the runner holds roughly 3x that many ASAN ClickHouse
servers up at once and the docker memory cgroup OOM-killer kills pytest (job exit 137),
even with the raised MAX_MEM_PER_WORKER_DIST_EACH budget. This surfaced on this PR's own
Integration tests (amd_asan_ubsan, flaky) run: the changed tests PASSED 6/6, then the job
was OOM-killed mid-run.

Fix: mark the suite is_sequential=True in TEST_CONFIGS so flaky/targeted checks route
it through the -n 1 sequential path (one cluster at a time, no OOM). The flaky check still
loops sequential modules max(3, workers) times, so flakiness coverage is unchanged. Normal
jobs use --dist=loadfile and are unaffected (one module = one cluster per file).

CI report (the asan flaky run that OOM-ed): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108557&sha=190238853238c9788c758a3c23ebcf761ccac6f4&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29

groeneai · 2026-06-25T23:01:09Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. Via the real integration harness with an injected ~8 s startup delay, `start_clickhouse(2)` deterministically raised `Cannot start ClickHouse` after 2.3 s (the exact CI signature); the triggering CI run showed the recovering node ready ~95 s after launch versus a 20 s wait.
b	Root cause explained?	Yes. The `test_recover_*` subtests SIGKILL a follower and restart it. Recovery = raft-log replay + snapshot install over S3 (1 KiB read buffer) + loading accumulated `system.text_log` parts, all before the server accepts connections. Under msan/tsan this legitimately took ~95 s, but `start_clickhouse(20)` waits only 20 s, so `wait_start` exhausts its budget and raises `Cannot start ClickHouse`. No crash / sanitizer report in the node log.
c	Fix matches root cause?	Yes. The failure is a too-tight test-side wait, so the wait is raised to `RESTART_TIMEOUT_SECONDS = 180`. `start_clickhouse` returns the instant the server is ready, so the larger value is an upper bound only. No assertion or data weakened.
d	Test intent preserved / new tests added?	Yes. Zero assertions changed; only the startup wait ceiling increased. The tests still verify snapshot-recovery correctness identically.
e	Both directions demonstrated?	Yes. With the injected delay: `start_clickhouse(2)` raised `Cannot start ClickHouse` after 2.3 s; `start_clickhouse(180)` succeeded after 6.2 s. Separately, the real `test_recover_from_snapshot_sent_by_old_leader[local_disk]` subtest passed green at 180 s (61.87 s) on a debug binary through the praktika integration runner.
f	Fix is general across code paths?	N/A (test timeout fix, not a code bug). All 6 `start_clickhouse(20)` calls in test.py and the 1 in test_concurrent.py were updated to the shared constant — no remaining magic 20.
g	Fix generalizes across inputs?	N/A (test timeout fix, not a code bug).
h	Backward compatible?	N/A (test-only change, no server behavior or format changed).
i	Invariants and contracts preserved?	N/A (test-only change). `start_clickhouse(n)` semantics unchanged — returns when ready, `n` is just the upper bound.

Session id: cron:clickhouse-worker-slot-3:20260625-221100

groeneai · 2026-06-25T23:01:36Z

cc @antonio2368 — could you review this? Test-only change: the test_keeper_snapshot_chunked_transfer recover tests restart a SIGKILL'd follower whose snapshot recovery legitimately takes ~95 s under msan/tsan, but they waited only start_clickhouse(20) and failed with Cannot start ClickHouse. Replaces the magic 20 with RESTART_TIMEOUT_SECONDS = 180 (an upper bound only; start returns as soon as the server is ready).

clickhouse-gh · 2026-06-26T13:25:33Z

Workflow [PR], commit [3018ef9]

Summary: ✅

AI Review

Summary

The PR raises the test_keeper_snapshot_chunked_transfer restart wait from a hardcoded 20 seconds to RESTART_TIMEOUT_SECONDS = 180, and marks the suite dist_each_sequential=True so flaky/targeted --dist=each runs do not start multiple full Keeper+S3 clusters concurrently. I did not find any correctness or CI-routing issue that needs an inline review comment.

Final Verdict

Status: ✅ Approve

groeneai · 2026-06-26T17:26:02Z

CI finish ledger — `1902388`

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test	Reason	Owner / fixing PR
Integration tests (amd_asan_ubsan, flaky)	host-OOM (docker memcg OOM-killer killed pytest/dockerd ~5min in; our changed `test_recover_*` tests PASSED 6/6, so NOT flaky and NOT PR-caused; this PR is a timeout-bound-only change adding zero memory)	a fix task is created (investigating at full effort, fixing-PR link to follow here)
CH Inc sync	-	CH Inc sync (private, not actionable by us)

The flaky-check ran this PR's changed tests many times and they passed every iteration; the job died from the docker memory cgroup OOM-killer (dmesg confirms), not a test failure or budget timeout. Finish Workflow + Style check pass on this head.

Session id: cron:our-pr-ci-monitor:20260626-170000

groeneai · 2026-06-26T17:55:34Z

Owner update for the `Integration tests (amd_asan_ubsan, flaky)` OOM ledger line

The fix for the flaky-check docker-memcg OOM is now in this PR (commit 081430b14f7), so this PR is its own fixing PR for that ledger line.

Root cause (confirmed from job.log). The flaky/targeted integration checks run with --dist=each -n 3 on a 16 CPU / 61 GB runner, so worker count is min(16//5, 61//20) = 3 and all 3 xdist workers run the same modules concurrently, each with its own isolated Docker cluster. test.py brings up an 18-instance module-scoped Keeper+S3/minio cluster (test_concurrent.py adds 10 more), so under --dist=each the runner holds roughly 3x18 ASAN ClickHouse servers + minio up at once. The docker memory cgroup OOM-killer killed pytest at 13:55:09 (3rd test, test_recover_after_interrupted_transfer[local_disk], exit -9 -> job 137). Our changed test_recover_* tests PASSED 6/6 first, so this is neither a flaky test nor PR-caused.

This is not covered by the already-merged MAX_MEM_PER_WORKER_DIST_EACH=20 budget (that commit is on this branch base, yet 3x20=60 GB on a 61 GB box left no headroom for an 18-instance ASAN cluster).

Fix. Mark the suite is_sequential=True in ci/jobs/scripts/integration_tests_configs.py TEST_CONFIGS. Flaky/targeted checks then route both modules through the -n 1 sequential path (one cluster at a time, no OOM). The flaky check still loops sequential modules max(3, workers) times, so flakiness coverage is unchanged. Normal --dist=loadfile jobs are unaffected (one module = one cluster per file). This mirrors the existing test_server_overload / test_storage_delta/test_azure_cluster.py entries that are sequential for resource-contention reasons.

CI report (the asan flaky run that OOM-ed): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108557&sha=190238853238c9788c758a3c23ebcf761ccac6f4&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29

Session id: cron:clickhouse-worker-slot-4:20260626-172700

groeneai · 2026-06-26T17:56:01Z

Pre-PR validation gate for commit 081430b14f7 (the sequential-routing OOM fix).

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. With 16 CPU / 61 GB the flaky check always computes `workers = min(16//5, 61//20) = 3` and runs `--dist=each -n 3`, so all 3 workers hold the 18-instance `test.py` cluster up at once. Reproduced on this PR's own asan flaky run (job exit 137, docker memcg OOM-killer in dmesg at 13:55:09).
b	Root cause explained?	`--dist=each` makes every xdist worker run the same modules concurrently, each with its own Docker cluster. The suite's `test.py` is an 18-instance Keeper+S3/minio module-scoped cluster (10 more in `test_concurrent.py`). 3 concurrent copies = ~54 ASAN ClickHouse servers + minio on a 61 GB box; the docker memory cgroup OOM-killer kills pytest. The already-merged `MAX_MEM_PER_WORKER_DIST_EACH=20` budget gives 3x20=60 GB nominal, no headroom for an 18-instance ASAN cluster.
c	Fix matches root cause?	Yes. The root cause is N-way concurrency under `--dist=each`. Marking the suite `is_sequential=True` routes it through the `-n 1` sequential path, so only one cluster is up at a time. Directly removes the concurrency that causes the OOM, not a symptom.
d	Test intent preserved / new tests added?	Yes. No test assertion or behavior is changed; this is a CI runner routing change only. Flaky-check coverage is unchanged: the sequential path still loops the modules `max(3, workers)` times (same execution count the 3 `--dist=each` workers gave).
e	Both directions demonstrated?	Yes (by routing assertion). Before the change `get_optimal_test_batch` placed both modules in the PARALLEL group (which is what `--dist=each` OOM-ed); after, a standalone call places both in the SEQUENTIAL group and the PARALLEL group is empty. Existing sequential entries (`test_server_overload`) are unchanged. `python3 -m py_compile` passes.
f	Fix is general across code paths?	Yes. Both modules of the suite (`test.py` + `test_concurrent.py`) are covered by the single prefix `test_keeper_snapshot_chunked_transfer/`. This is the same idiomatic mechanism used for other resource-contention suites (`test_server_overload`, `test_storage_delta/test_azure_cluster.py`).
g	Fix generalizes across inputs?	N/A (CI test-runner routing change, not a code bug).
h	Backward compatible?	N/A (CI test-runner routing change, no user-facing behavior, no settings/format). Normal `--dist=loadfile` jobs are unaffected: `is_sequential` only diverts from `--dist=each` N-way concurrency, and loadfile already runs one cluster per module file.
i	Invariants and contracts preserved?	Yes. The `parallel_skip_prefixes` sanity check (`integration_tests_configs.py:332`) still holds: in every non-flaky batch `tests` is the full `./tests/integration/test_/test.py` glob, and in flaky/targeted mode `get_parallel_sequential_tests_to_run` calls `get_optimal_test_batch` with that same full glob, so the new prefix always matches at least one file. Verified the glob contains both modules.

Session id: cron:clickhouse-worker-slot-4:20260626-172700

groeneai · 2026-06-26T17:56:32Z

cc @Algunenano — could you take a look at the second commit? It marks test_keeper_snapshot_chunked_transfer is_sequential=True in ci/jobs/scripts/integration_tests_configs.py so the flaky/targeted integration checks (--dist=each) stop OOM-killing the runner with 3 concurrent copies of its 18-node Keeper+S3 cluster. Same mechanism as the existing test_server_overload / test_storage_delta/test_azure_cluster.py entries. Normal --dist=loadfile jobs are unaffected.

antonio2368 · 2026-06-29T07:51:59Z

    ),
    TC("test_random_inserts/", False, "standard replicated inserts test; cluster is fully isolated"),
    TC("test_server_overload/", True, "uses taskset to pin ClickHouse to specific CPU cores; sensitive to concurrent CPU load"),
+    TC("test_keeper_snapshot_chunked_transfer/", True, "18-node Keeper+S3 cluster; concurrent --dist=each copies OOM the ASAN runner"),


Should this be something like

TC( "test_keeper_snapshot_chunked_transfer/", False, "18-node Keeper+S3 cluster; concurrent --dist=each copies OOM the ASAN runner", dist_each_sequential=True, ),

Done in 10b5ee3 — added dist_each_sequential to TC and used exactly that form. It routes the suite sequential only under --dist=each (flaky/targeted); regular --dist=loadfile jobs keep it parallel. Verified locally: keeper -> sequential under flaky/targeted, parallel under regular/llvm_coverage; is_sequential=True configs unchanged.

Rebased onto master (was conflicting). Master #107985 independently added the same dist_each_sequential field + force_heavy_modules_sequential routing, so I dropped my framework code and the branch now adds only the one keeper TC(...) entry in exactly the form above. Master's regression suite ci/tests/test_force_heavy_modules_sequential.py passes (8/8, incl. the prefix-matches-real-path guard); routing reverified: keeper sequential under --dist=each, parallel under --dist=loadfile. Diff is now 3 files +25/-7.

…over tests The test_recover_* subtests SIGKILL a follower and restart it; recovery means replaying raft logs and installing a snapshot (over S3 with a 1 KiB read buffer) before the server accepts connections. Under msan/tsan this legitimately took ~95 s in CI (server.log: 'Ready for connections' ~95 s after launch, no crash or sanitizer report), but start_clickhouse(20) waited only 20 s and raised 'Cannot start ClickHouse'. Replace the magic 20 with a named RESTART_TIMEOUT_SECONDS = 180 in both test.py and test_concurrent.py. start_clickhouse returns as soon as the server is ready, so the larger upper bound costs nothing on fast runs and only stops the test failing when recovery legitimately takes 60-120 s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Mark the suite dist_each_sequential=True. The flaky/targeted integration checks run the parallel bucket with --dist=each, so every xdist worker brings up its own full 18-node Keeper+S3 cluster of test.py concurrently (3 copies on the 61 GiB AMD_MEDIUM ASan runner), which the kernel OOM-killed (job exit 137). The new flag routes both test.py and test_concurrent.py to the looped sequential -n 1 bucket under --dist=each (one cluster at a time, repeated >=3x, so the flakiness signal is preserved) while leaving them parallel under --dist=loadfile in the ~140 normal integration jobs (one file -> one worker -> one cluster, which never OOMs). The dist_each_sequential field and force_heavy_modules_sequential routing landed in master via ClickHouse#107985; this only adds the keeper suite to TEST_CONFIGS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Pick up revert e531152 of PR ClickHouse#105591 (vector search rescoring exact row-positioning) which caused 02354_vector_search_rescoring to fail with ILLEGAL_COLUMN on _distance. Our branch was rebased onto the pre-revert master pin d562138.

groeneai · 2026-06-30T12:49:47Z

CI finish ledger — `3b92000`

CI fully finished on this SHA (Finish Workflow=pass; only CH Inc sync still pending, which is exempt). Every failure below has an owner.

Check / test	Reason	Owner / fixing PR
Stateless tests (amd_llvm_coverage, 3/3 + AsyncInsert s3 + old-analyzer s3 DatabaseReplicated WasmEdge) / `02354_vector_search_rescoring`	trunk regression (`ILLEGAL_COLUMN` on `_distance` with `vector_search_with_rescoring=0`, `useVectorSearch.cpp:237`)	Fixing commit `e531152b092` (revert of #105591, merged on master). Our branch was rebased onto the pre-revert pin `d56213874eb`; merged master @ `3018ef99f08` to pick up the revert.
CH Inc sync	—	CH Inc sync (private, not actionable by us)

This PR touches only the keeper-snapshot integration tests + the integration-test CI config; it cannot affect a vector-search stateless test. The failure was the brief 2026-06-29 master spike from #105591 (merged 11:10 UTC, reverted 12:59 UTC, 46 master hits + 18 unrelated PRs in the 2h window) which our pre-revert rebase base captured.

Session id: cron:our-pr-ci-monitor:20260630-123000

clickhouse-gh · 2026-06-30T15:19:37Z

LLVM Coverage Report

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 48 line(s), 2 function(s) across 12 file(s) · Details

Top files

src/Backups/BackupEntriesCollector.cpp: 25 line(s), 2 function(s)
src/Processors/Sources/MySQLSource.cpp: 5 line(s)
src/Storages/ConstraintsDescription.cpp: 5 line(s)
src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.cpp: 3 line(s)
base/base/chrono_io.h: 2 line(s)

Full report

groeneai mentioned this pull request Jun 25, 2026

Speed up default trimLeft when no row needs trimming #108177

Merged

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 26, 2026

clickhouse-gh Bot added the pr-ci label Jun 26, 2026

clickhouse-gh Bot added the manual approve Manual approve required to run CI label Jun 26, 2026

antonio2368 self-assigned this Jun 29, 2026

antonio2368 requested changes Jun 29, 2026

View reviewed changes

groeneai and others added 2 commits June 29, 2026 11:39

groeneai force-pushed the fix-keeper-snapshot-recover-startup-timeout branch from 10b5ee3 to 3b92000 Compare June 29, 2026 11:42

antonio2368 approved these changes Jun 30, 2026

View reviewed changes

alexey-milovidov added this pull request to the merge queue Jun 30, 2026

Merged via the queue into ClickHouse:master with commit 7c98f89 Jun 30, 2026
174 checks passed

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 30, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

clickhouse-gh Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

groeneai commented Jun 26, 2026

CI finish ledger — 1902388

Uh oh!

groeneai commented Jun 26, 2026

Owner update for the Integration tests (amd_asan_ubsan, flaky) OOM ledger line

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

antonio2368 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 30, 2026

CI finish ledger — 3b92000

Uh oh!

clickhouse-gh Bot commented Jun 30, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

groeneai commented Jun 25, 2026 •

edited

Loading

clickhouse-gh Bot commented Jun 26, 2026 •

edited

Loading

CI finish ledger — `1902388`

Owner update for the `Integration tests (amd_asan_ubsan, flaky)` OOM ledger line

CI finish ledger — `3b92000`