iframe-proxy

groeneai · 2026-07-01T09:16:58Z

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Fixes flaky test_keeper_session_refuse_stale_server (requested by @alexey-milovidov on #108074).

The test restarts all three Keeper containers, then waits for them to come back. stop_zookeeper_nodes() stops them serially, but start_zookeeper_nodes() went through process_integration_nodes(), which ran one docker compose start <node> process per node concurrently (a ThreadPoolExecutor) against the same compose project.

Concurrent docker compose invocations on one project race on the project's shared state: on a loaded CI runner (integration tests run with -n workers) the start of one node can be silently dropped.

On the failing amd_msan run the Keeper's graceful stop overran docker's stop timeout and was SIGKILLed (exit 137); the three concurrent docker compose start processes then fired within ~2 ms of that stop returning, and dockerd received no /start for zoo3 (it did for zoo1/zoo2). zoo3 never came back, so wait_zookeeper_to_start() timed out after 180 s and raised the generic Cannot wait ZooKeeper container ... iptables-nft exception at test.py:73. The same latent race affected kill_zookeeper_nodes().

Fix: use a single docker compose <action> <node1> <node2> ... invocation instead of N concurrent ones. This is the pattern the framework already uses for startup (a single docker compose up -d); compose parallelizes the services internally, so there is no loss of parallelism and no shared-project race.

CI failure this addresses (Integration tests (amd_msan, 5/8), PR #108074, commit 817d237610a2d554f27994aedfdd07de1fef0bdc):
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108074&sha=817d237610a2d554f27994aedfdd07de1fef0bdc&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%205%2F8%29

Evidence from that report: dockerd received zoo3's container /json inspects but no /start while zoo1/zoo2 got /start; docker.log shows zoo3-1 exited with code 137 with no subsequent restart; the gw0 log then polls zoo3: bad hostname for the full 180 s.

No related open issue found (searched by test name, process_integration_nodes, start_zookeeper_nodes).

Version info

Merged into: 26.7.1.370

The test restarts all three Keeper containers, then waits for them to come back. stop_zookeeper_nodes() stops them serially, but start_zookeeper_nodes() went through process_integration_nodes(), which ran one `docker compose start <node>` process per node concurrently (ThreadPoolExecutor) against the same compose project. Concurrent `docker compose` invocations on one project race on the project's shared state: on a loaded CI runner (integration tests run with -n workers), the start of one node can be silently dropped. On the amd_msan failure the Keeper's graceful stop overran docker's stop timeout and was SIGKILLed (exit 137); the three concurrent `docker compose start` processes then fired within 2 ms of that stop returning, and dockerd received no /start for zoo3 (it did for zoo1/zoo2). zoo3 never came back, so wait_zookeeper_to_start() timed out after 180 s and raised the generic "Cannot wait ZooKeeper container ... iptables-nft" exception. The same latent race affected kill_zookeeper_nodes(). Use a single `docker compose <action> <node1> <node2> ...` invocation instead of N concurrent ones. This is the pattern the framework already uses for startup (a single `docker compose up -d`); compose parallelizes the services internally, so there is no loss of parallelism and no shared-project race. Verified via the CI report for PR ClickHouse#108074 (amd_msan 5/8): dockerd received zoo3's container /json inspects but no /start, while zoo1/zoo2 got /start; docker.log shows "zoo3-1 exited with code 137" with no subsequent restart. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-07-01T09:17:51Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Partial. The full timing race needs CI's loaded multi-worker daemon; on an idle single-tenant host it did not fire in 60+ cycles (including with the exact CI `docker compose` v2.29.7 binary + slow-SIGKILL zoo3 + 8 churning projects). But the mechanism is proven deterministically from the CI report at the dockerd-API level (see f).
b	Root cause explained?	Yes. `stop_zookeeper_nodes` stops serially; `start_zookeeper_nodes` -> `process_integration_nodes` fired N concurrent `docker compose start <node>` processes on the SAME compose project. Concurrent compose invocations race on shared project state; on the amd_msan run zoo3's graceful stop overran docker's timeout (SIGKILL, exit 137), the 3 concurrent starts fired ~2ms after that stop returned, and dockerd got zoo3's /json inspects but NO /start (zoo1/zoo2 got /start). zoo3 never restarted -> `wait_zookeeper_to_start()` timed out 180s -> generic "iptables-nft" exception at test.py:73.
c	Fix matches root cause?	Yes. Replaces N concurrent single-service invocations with ONE atomic `docker compose <action> node1 node2 ...`, eliminating the concurrent-compose-on-one-project race. Same pattern the framework already uses for startup (single `docker compose up -d`).
d	Test intent preserved / new tests added?	Yes. No test assertion changed. The helper still stops/starts/kills exactly the requested nodes; compose parallelizes services internally so there is no loss of parallelism. Fix is in shared framework code exercised by 8 existing keeper-restart tests.
e	Both directions demonstrated?	Partial. Could not force the low-rate race locally (idle host). Confirmed the FIXED atomic path reliably restarts all 3 keepers incl. a SIGKILL'd zoo3 across 30+ cycles under contention; confirmed in isolation that `docker compose start` on an exit-137 container works, so the failure is the concurrency, not the exit state. The CI report is the primary evidence (f).
f	Fix is general across code paths?	Yes. `process_integration_nodes` is the sole caller path for BOTH `start_zookeeper_nodes` and `kill_zookeeper_nodes` (same latent race) and the only user of `concurrent` in the integration tree, now removed. Report-backed: dockerd received zoo3's /json inspects but no /start while zoo1/zoo2 got /start; `docker.log` shows `zoo3-1 exited with code 137` with no restart.
g	Fix generalizes across inputs?	N/A (test-infrastructure concurrency fix, not a code-path/datatype bug). Applies uniformly to any node list and any action (start/kill).
h	Backward compatible?	N/A (test-only helper; no server behavior, setting, or on-disk/wire format changed).
i	Invariants and contracts preserved?	Yes. `process_integration_nodes` contract (perform `action` on all `nodes`, raise on failure) is preserved: a single `subprocess_check_call` still raises on non-zero exit. `list(nodes)` guards against a generator argument.

Session id: cron:clickhouse-worker-slot-1:20260701-080800

groeneai · 2026-07-01T09:17:59Z

cc @maxknv — could you review this? It fixes a concurrent-docker compose-on-one-project race in the integration test framework: process_integration_nodes (used by start_zookeeper_nodes/kill_zookeeper_nodes) ran one docker compose start <node> process per node concurrently against the same project, and on a loaded runner one node's start can be silently dropped (proven from the #108074 amd_msan report: dockerd got zoo3's /json but no /start). Switched to a single atomic docker compose <action> node1 node2 ... invocation.

clickhouse-gh · 2026-07-01T09:36:52Z

groeneai mentioned this pull request Jul 1, 2026

Use libdeflate for gzip/zlib/deflate compression and decompression #108074

Open

PedroTadim added the can be tested Allows running workflows for external contributors label Jul 1, 2026

clickhouse-gh Bot added the pr-ci label Jul 1, 2026

alexey-milovidov approved these changes Jul 1, 2026

View reviewed changes

alexey-milovidov self-assigned this Jul 1, 2026

alexey-milovidov added this pull request to the merge queue Jul 1, 2026

Merged via the queue into ClickHouse:master with commit c0022c8 Jul 1, 2026
174 checks passed

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 1, 2026

groeneai mentioned this pull request Jul 2, 2026

Guard ColumnString::filter against offsets inconsistent with chars array #109184

Open

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky test_keeper_session_refuse_stale_server#109037

Fix flaky test_keeper_session_refuse_stale_server#109037
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:fix-flaky-test_keeper_session_refuse_stale_server

groeneai commented Jul 1, 2026 •

edited by robot-ch-test-poll

Loading

Uh oh!

groeneai commented Jul 1, 2026

Uh oh!

groeneai commented Jul 1, 2026

Uh oh!

clickhouse-gh Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jul 1, 2026 • edited by robot-ch-test-poll Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Version info

Uh oh!

groeneai commented Jul 1, 2026

Uh oh!

groeneai commented Jul 1, 2026

Uh oh!

clickhouse-gh Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Missing context / blind spots

Final Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

groeneai commented Jul 1, 2026 •

edited by robot-ch-test-poll

Loading

clickhouse-gh Bot commented Jul 1, 2026 •

edited

Loading