Fix flaky test_keeper_session_refuse_stale_server by groeneai · Pull Request #109037 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix flaky test_keeper_session_refuse_stale_server#109037

Merged
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:fix-flaky-test_keeper_session_refuse_stale_server
Jul 1, 2026
Merged

Fix flaky test_keeper_session_refuse_stale_server#109037
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:fix-flaky-test_keeper_session_refuse_stale_server

Conversation

@groeneai

@groeneai groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Fixes flaky test_keeper_session_refuse_stale_server (requested by @alexey-milovidov on #108074).

The test restarts all three Keeper containers, then waits for them to come back. stop_zookeeper_nodes() stops them serially, but start_zookeeper_nodes() went through process_integration_nodes(), which ran one docker compose start <node> process per node concurrently (a ThreadPoolExecutor) against the same compose project.

Concurrent docker compose invocations on one project race on the project's shared state: on a loaded CI runner (integration tests run with -n workers) the start of one node can be silently dropped.

On the failing amd_msan run the Keeper's graceful stop overran docker's stop timeout and was SIGKILLed (exit 137); the three concurrent docker compose start processes then fired within ~2 ms of that stop returning, and dockerd received no /start for zoo3 (it did for zoo1/zoo2). zoo3 never came back, so wait_zookeeper_to_start() timed out after 180 s and raised the generic Cannot wait ZooKeeper container ... iptables-nft exception at test.py:73. The same latent race affected kill_zookeeper_nodes().

Fix: use a single docker compose <action> <node1> <node2> ... invocation instead of N concurrent ones. This is the pattern the framework already uses for startup (a single docker compose up -d); compose parallelizes the services internally, so there is no loss of parallelism and no shared-project race.

CI failure this addresses (Integration tests (amd_msan, 5/8), PR #108074, commit 817d237610a2d554f27994aedfdd07de1fef0bdc):
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108074&sha=817d237610a2d554f27994aedfdd07de1fef0bdc&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%205%2F8%29

Evidence from that report: dockerd received zoo3's container /json inspects but no /start while zoo1/zoo2 got /start; docker.log shows zoo3-1 exited with code 137 with no subsequent restart; the gw0 log then polls zoo3: bad hostname for the full 180 s.

No related open issue found (searched by test name, process_integration_nodes, start_zookeeper_nodes).

Version info

  • Merged into: 26.7.1.370

The test restarts all three Keeper containers, then waits for them to come
back. stop_zookeeper_nodes() stops them serially, but start_zookeeper_nodes()
went through process_integration_nodes(), which ran one `docker compose start
<node>` process per node concurrently (ThreadPoolExecutor) against the same
compose project.

Concurrent `docker compose` invocations on one project race on the project's
shared state: on a loaded CI runner (integration tests run with -n workers), the
start of one node can be silently dropped. On the amd_msan failure the Keeper's
graceful stop overran docker's stop timeout and was SIGKILLed (exit 137); the
three concurrent `docker compose start` processes then fired within 2 ms of that
stop returning, and dockerd received no /start for zoo3 (it did for zoo1/zoo2).
zoo3 never came back, so wait_zookeeper_to_start() timed out after 180 s and
raised the generic "Cannot wait ZooKeeper container ... iptables-nft" exception.
The same latent race affected kill_zookeeper_nodes().

Use a single `docker compose <action> <node1> <node2> ...` invocation instead of
N concurrent ones. This is the pattern the framework already uses for startup
(a single `docker compose up -d`); compose parallelizes the services internally,
so there is no loss of parallelism and no shared-project race.

Verified via the CI report for PR ClickHouse#108074 (amd_msan 5/8): dockerd received
zoo3's container /json inspects but no /start, while zoo1/zoo2 got /start;
docker.log shows "zoo3-1 exited with code 137" with no subsequent restart.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

cc @maxknv — could you review this? It fixes a concurrent-docker compose-on-one-project race in the integration test framework: process_integration_nodes (used by start_zookeeper_nodes/kill_zookeeper_nodes) ran one docker compose start <node> process per node concurrently against the same project, and on a loaded runner one node's start can be silently dropped (proven from the #108074 amd_msan report: dockerd got zoo3's /json but no /start). Switched to a single atomic docker compose <action> node1 node2 ... invocation.

@PedroTadim PedroTadim added the can be tested Allows running workflows for external contributors label Jul 1, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jul 1, 2026
@alexey-milovidov alexey-milovidov self-assigned this Jul 1, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jul 1, 2026
Merged via the queue into ClickHouse:master with commit c0022c8 Jul 1, 2026
174 checks passed
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants