{{ message }}
Fix flaky test_keeper_session_refuse_stale_server#109037
Merged
alexey-milovidov merged 1 commit intoJul 1, 2026
Merged
Conversation
The test restarts all three Keeper containers, then waits for them to come back. stop_zookeeper_nodes() stops them serially, but start_zookeeper_nodes() went through process_integration_nodes(), which ran one `docker compose start <node>` process per node concurrently (ThreadPoolExecutor) against the same compose project. Concurrent `docker compose` invocations on one project race on the project's shared state: on a loaded CI runner (integration tests run with -n workers), the start of one node can be silently dropped. On the amd_msan failure the Keeper's graceful stop overran docker's stop timeout and was SIGKILLed (exit 137); the three concurrent `docker compose start` processes then fired within 2 ms of that stop returning, and dockerd received no /start for zoo3 (it did for zoo1/zoo2). zoo3 never came back, so wait_zookeeper_to_start() timed out after 180 s and raised the generic "Cannot wait ZooKeeper container ... iptables-nft" exception. The same latent race affected kill_zookeeper_nodes(). Use a single `docker compose <action> <node1> <node2> ...` invocation instead of N concurrent ones. This is the pattern the framework already uses for startup (a single `docker compose up -d`); compose parallelizes the services internally, so there is no loss of parallelism and no shared-project race. Verified via the CI report for PR ClickHouse#108074 (amd_msan 5/8): dockerd received zoo3's container /json inspects but no /start, while zoo1/zoo2 got /start; docker.log shows "zoo3-1 exited with code 137" with no subsequent restart. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
Contributor
Author
|
cc @maxknv — could you review this? It fixes a concurrent- |
Contributor
alexey-milovidov
approved these changes
Jul 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
Fixes flaky
test_keeper_session_refuse_stale_server(requested by @alexey-milovidov on #108074).The test restarts all three Keeper containers, then waits for them to come back.
stop_zookeeper_nodes()stops them serially, butstart_zookeeper_nodes()went throughprocess_integration_nodes(), which ran onedocker compose start <node>process per node concurrently (aThreadPoolExecutor) against the same compose project.Concurrent
docker composeinvocations on one project race on the project's shared state: on a loaded CI runner (integration tests run with-nworkers) the start of one node can be silently dropped.On the failing amd_msan run the Keeper's graceful stop overran docker's stop timeout and was SIGKILLed (exit 137); the three concurrent
docker compose startprocesses then fired within ~2 ms of that stop returning, and dockerd received no/startfor zoo3 (it did for zoo1/zoo2). zoo3 never came back, sowait_zookeeper_to_start()timed out after 180 s and raised the genericCannot wait ZooKeeper container ... iptables-nftexception attest.py:73. The same latent race affectedkill_zookeeper_nodes().Fix: use a single
docker compose <action> <node1> <node2> ...invocation instead of N concurrent ones. This is the pattern the framework already uses for startup (a singledocker compose up -d); compose parallelizes the services internally, so there is no loss of parallelism and no shared-project race.CI failure this addresses (Integration tests (amd_msan, 5/8), PR #108074, commit
817d237610a2d554f27994aedfdd07de1fef0bdc):https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108074&sha=817d237610a2d554f27994aedfdd07de1fef0bdc&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%205%2F8%29
Evidence from that report: dockerd received zoo3's container
/jsoninspects but no/startwhile zoo1/zoo2 got/start;docker.logshowszoo3-1 exited with code 137with no subsequent restart; the gw0 log then pollszoo3: bad hostnamefor the full 180 s.No related open issue found (searched by test name,
process_integration_nodes,start_zookeeper_nodes).Version info
26.7.1.370