Fix flaky test_startup_without_zookeeper: retry the recursive ZooKeeper delete by alexey-milovidov · Pull Request #108791 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix flaky test_startup_without_zookeeper: retry the recursive ZooKeeper delete#108791

Merged
Algunenano merged 1 commit into
masterfrom
fix-flaky-test-replication-without-zookeeper
Jun 29, 2026
Merged

Fix flaky test_startup_without_zookeeper: retry the recursive ZooKeeper delete#108791
Algunenano merged 1 commit into
masterfrom
fix-flaky-test-replication-without-zookeeper

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 29, 2026

Copy link
Copy Markdown
Member

Fixes a flaky integration test.

test_replication_without_zookeeper/test.py::test_startup_without_zookeeper intermittently fails with kazoo.exceptions.NotEmptyError while executing zk.delete(path="/clickhouse", recursive=True) in drop_zk.

The recursive delete runs while ClickHouse (node1) is still alive, so its ReplicatedMergeTree background threads keep re-creating ZooKeeper nodes under /clickhouse. kazoo's recursive delete races with that: between get_children and delete a new child node can appear, and kazoo raises NotEmptyError without retrying.

The test already called cluster.run_kazoo_commands_with_retries, but with the default repeats=1. In that helper the retry loop is for i in range(repeats - 1), so repeats=1 performs zero retries — the callback runs exactly once with no exception handling. Pass repeats=5 (matching the existing usages in helpers/cluster.py) so the drop is retried on the transient NotEmptyError.

This is a pre-existing flaky test, not a regression — the same NotEmptyError in drop_zk has been observed on unrelated pull requests.

Related: #103540
Related: #101757

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=d40ea5d0da4103b54971ac01a9960ef9153242bb&name_0=MasterCI&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20db%20disk%2C%20old%20analyzer%2C%203%2F6%29

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Not required.

🤖 Generated with Claude Code

Version info

  • Merged into: 26.7.1.240

`test_replication_without_zookeeper/test.py::test_startup_without_zookeeper`
intermittently fails with `kazoo.exceptions.NotEmptyError` inside `drop_zk`,
which calls `zk.delete(path="/clickhouse", recursive=True)`.

The drop happens while `node1` (ClickHouse) is still running, so its
`ReplicatedMergeTree` background threads keep re-creating ZooKeeper nodes under
`/clickhouse`. kazoo's recursive delete races with that: between `get_children`
and `delete` a new child node can appear, and kazoo raises `NotEmptyError`
without retrying.

The test already used `cluster.run_kazoo_commands_with_retries`, but with the
default `repeats=1`. In that helper the retry loop is `for i in range(repeats - 1)`,
so `repeats=1` performs zero retries — the callback runs exactly once with no
exception handling. Pass `repeats=5` (matching the existing usages in
`helpers/cluster.py`) so the drop is retried on the transient `NotEmptyError`.

This is a pre-existing flaky test (same failure seen on unrelated PRs, e.g.
#103540 and #101757), not a regression.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=d40ea5d0da4103b54971ac01a9960ef9153242bb&name_0=MasterCI&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20db%20disk%2C%20old%20analyzer%2C%203%2F6%29

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 29, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.50% +0.10%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 1022 line(s), 97 function(s) across 155 file(s) · Details

Top files
  • programs/keeper/Keeper.cpp: 259 line(s), 25 function(s)
  • src/Server/KeeperHTTPHandlerFactory.cpp: 100 line(s), 18 function(s)
  • src/Storages/ObjectStorage/DataLakes/Iceberg/ComplexTypeSchemaProcessorFunctions.cpp: 83 line(s), 8 function(s)
  • src/Interpreters/InterpreterSystemQuery.cpp: 52 line(s)
  • src/Functions/FunctionShowCertificate.cpp: 38 line(s), 4 function(s)

Full report

@Algunenano Algunenano self-assigned this Jun 29, 2026
@Algunenano Algunenano added this pull request to the merge queue Jun 29, 2026
Merged via the queue into master with commit a276240 Jun 29, 2026
174 checks passed
@Algunenano Algunenano deleted the fix-flaky-test-replication-without-zookeeper branch June 29, 2026 11:45
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants