iframe-proxy

groeneai · 2026-05-17T09:15:27Z

Related: #67732

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a heap-use-after-free in DatabaseReplicated::tryGetReplicasInfo (reachable by querying system.clusters for replica-state columns such as is_active / unsynced_after_recovery / recovery_time while a Replicated database is being dropped or detached), and lowered the log level of swallowed Keeper exceptions in DatabaseReplicated::tryGetCluster, tryGetAllGroupsCluster, and tryGetReplicasInfo from <Error> to <Information>. The exceptions are caught and ignored by design (callers such as system.clusters treat the affected database as transiently unavailable and skip it), but previously the error-level log line was forwarded to client stderr at the default send_logs_level=warning, causing unrelated tests touching system.clusters (e.g. 01293_show_clusters) to appear failed on parallel stateless lanes.

Description

This PR fixes two related issues in DatabaseReplicated's system.clusters read path.

1. Flaky `01293_show_clusters` (log level) - original PR scope

DatabaseReplicated::tryGetCluster and the sibling tryGetAllGroupsCluster / tryGetReplicasInfo intentionally swallow exceptions thrown while reading the database's /replicas Keeper state. The existing comment in tryGetCluster explains why:

A quick fix for stateless tests with DatabaseReplicated. Its ZK node can be destroyed at any time. If another test lists system.clusters to get client command line suggestions, it will get an error when trying to get the info about DB from ZK. Just ignore these inaccessible databases.

The exception is swallowed (cluster stays null, callers skip the database), but the catch block was logging it via tryLogCurrentException(log), which defaults to LogsLevel::error. The CI test runner forwards server logs at send_logs_level=warning and above to the client stderr, so the swallowed exception still showed up as test stderr noise, reporting unrelated system.clusters queries (e.g. 01293_show_clusters) as failed on arm_binary / amd_debug / s3 lanes.

The three catches now log via tryLogCurrentException(log, "...", LogsLevel::information) for the expected coordination/connection codes (KEEPER_EXCEPTION, ALL_CONNECTION_TRIES_FAILED, NO_ACTIVE_REPLICAS); anything else stays at the default error level so genuine problems remain visible to operators.

Regression tests:

04252_database_replicated_system_clusters_log_level.sh - deletes /replicas, asserts no Code: (279|999) leaks to client stderr.
04254_database_replicated_system_clusters_no_active_replicas.sh - empty /replicas, asserts no Code: 254 (NO_ACTIVE_REPLICAS) leak.
04278_database_replicated_system_clusters_replicas_info.sh - deletes /max_log_ptr, exercises the tryGetReplicasInfo path specifically.

2. Data race / heap-use-after-free on `ddl_worker` in `tryGetReplicasInfo`

The TSan flaky check on this PR caught a real, pre-existing data race in tryGetReplicasInfo:

WARNING: ThreadSanitizer: data race
  Write of size 8 by thread T (DROP DATABASE):
    DatabaseReplicated::shutdown()                 -> ddl_worker = nullptr (frees the worker, under ddl_worker_mutex)
    operator delete / ~DatabaseReplicatedDDLWorker  src/Databases/DatabaseReplicatedWorker.h:23
  Previous read of size 8 by thread U (SELECT ... FROM system.clusters):
    DatabaseReplicatedDDLWorker::isUnsyncedAfterRecovery()  src/Databases/DatabaseReplicatedWorker.h:43
    DatabaseReplicated::tryGetReplicasInfo()                src/Databases/DatabaseReplicated.cpp:588
    StorageSystemClusters::fillData()

In tryGetReplicasInfo, the recovery_time read was correctly performed under ddl_worker_mutex, but the immediately following unsynced_after_recovery = ddl_worker && ddl_worker->isUnsyncedAfterRecovery() read happened outside that lock (the lock_guard scope had already ended). shutdown() resets and frees ddl_worker under ddl_worker_mutex, so a SELECT is_active / unsynced_after_recovery / recovery_time FROM system.clusters running concurrently with DROP/DETACH DATABASE could dereference the freed worker - a heap-use-after-free (also reachable in release builds, not just under sanitizers).

The fix moves the isUnsyncedAfterRecovery() read into the same ddl_worker_mutex block as the recovery_time read. Behavior is unchanged: both members are read only when ddl_worker is non-null, and unsynced_after_recovery stays false when it is null. All writes to ddl_worker already hold ddl_worker_mutex, so the read is now race-free.

CI provenance for the race (this PR's flaky check):

Check: Stateless tests (amd_tsan, flaky check), STID 5804-51e6
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105149&sha=9cb2bfe9cddafd1b736545f775b8995d6e980f7c&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20flaky%20check%29

Verification:

Built the amd_tsan binary with the fix; the natural workload (create Replicated DB, concurrent SELECT is_active/unsynced_after_recovery/recovery_time FROM system.clusters, DROP DATABASE, repeated under TSan) runs clean - no data race, no deadlock.
Confirmed the reader reaches the unlocked ddl_worker access in tryGetReplicasInfo (the affected line) before the fix.
All three regression tests (04252 / 04254 / 04278) pass on the TSan build.

Version info

Merged into: 26.7.1.276

groeneai · 2026-05-17T09:15:56Z

Pre-PR validation gate (per TASK.md Phase 4 Step 9)

a) Deterministic repro? ✅ Confirmed.
Command:

clickhouse client -q "CREATE DATABASE foo ENGINE = Replicated('/test/foo', 'shard1', 'replica1')"
clickhouse keeper-client --port 9181 --query "rmr /test/foo/replicas"
clickhouse client --send_logs_level=warning -q "SELECT count() FROM system.clusters" 2>&1

Without the fix this always emits <Error> DatabaseReplicated (foo): Code: 999. Coordination::Exception: Coordination error: No node, path /test/foo/replicas. (KEEPER_EXCEPTION) to client stderr. With the fix the stderr is empty.

b) Root cause explained? ✅
DatabaseReplicated::tryGetCluster (and the two siblings) catch KEEPER_EXCEPTION / ALL_CONNECTION_TRIES_FAILED thrown by getClusterImpl while reading /replicas, return nullptr, and let the caller (e.g. system.clusters filling) skip the database. The catch block called tryLogCurrentException(log), which defaults to LogsLevel::error. The test runner sets send_logs_level=warning and forwards everything at warning-and-above to client stderr. So any time another session is racing on a Replicated database's Keeper state, the swallowed exception still leaks into the stderr of an unrelated client, and that stderr is what clickhouse-test reports as a failure.

c) Fix matches root cause? ✅
Lowered the log level in all three catches to LogsLevel::information with a descriptive message. information is above the production logger default (so administrators still see the message in normal server logs), but below send_logs_level=warning, so it no longer propagates to client stderr. No behavior change to callers -- the exception is still swallowed and nullptr is still returned. Matches the existing pattern in the same file (Async loading failed at line 2168 is logged at LogsLevel::warning).

d) Test intent preserved? / New tests added? ✅
01293_show_clusters.sh was not modified -- it continues to test SHOW CLUSTERS as before. Added a new dedicated regression test 04252_database_replicated_system_clusters_log_level.sh that deterministically forces tryGetCluster to throw and asserts no DatabaseReplicated.*Code: (279|999) text leaks into client stderr.

e) Demonstrated in both directions? ✅

Without the fix (unmodified binary, same scenario): client stderr contains the full Code: 999. Coordination::Exception: Coordination error: No node, path /clickhouse/test/repro_v2/replicas. (KEEPER_EXCEPTION), Stack trace ... text starting with the 7. src/Databases/DatabaseReplicated.cpp:390:39: DB::DatabaseReplicated::getClusterImpl(bool) const line, which my test's grep would catch.
With the fix: client stderr is empty; system.text_log shows the message at Information level with the new prefix Failed to get cluster info (possibly due to concurrent database lifecycle operations): Code: 999. ....
Regression test stress-ran 5/5 PASS locally with unique CLICKHOUSE_DATABASE per run.

f) Fix is general, not a narrow patch? ✅
Examined all tryLogCurrentException(log); call sites in DatabaseReplicated.cpp. Three are reached from the system.clusters code path (tryGetCluster, tryGetAllGroupsCluster, tryGetReplicasInfo) -- all three are now consistent (information). The other two (tryConnectToZooKeeperAndInitDatabase at line 605, and the ALTER replication-classification helper at line 2575) are not on the system.clusters path and have different semantics; they were intentionally left unchanged.

session: cron:clickhouse-ci-task-worker:20260517-084500
task: 2026-05-15-ci-01293showclusters-flaky-in-statele

groeneai · 2026-05-17T09:16:33Z

cc @tavplubix @alesapin — could you review this? Three log-level changes in DatabaseReplicated (tryGetCluster, tryGetAllGroupsCluster, tryGetReplicasInfo) lowering swallowed KEEPER_EXCEPTION / ALL_CONNECTION_TRIES_FAILED from <Error> to <Information> so they stop leaking into client stderr at send_logs_level=warning and breaking unrelated parallel stateless tests (e.g. 01293_show_clusters). Comes with a new deterministic regression test 04252_database_replicated_system_clusters_log_level.sh.

clickhouse-gh · 2026-05-17T13:27:51Z

Workflow [PR], commit [e2317ca]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR lowers the log level for expected transient DatabaseReplicated Keeper/read races in the system.clusters path, keeps unexpected exceptions at error, and fixes the ddl_worker lifetime race in tryGetReplicasInfo by reading it under ddl_worker_mutex. The current tests cover the changed tryGetCluster, tryGetAllGroupsCluster, and tryGetReplicasInfo paths, and I did not find any new blocker or major issue in the current diff.

Final Verdict

Status: ✅ Approve

…trap cleanup Per @clickhouse-gh[bot] review feedback on PR ClickHouse#105149: 1. `src/Databases/DatabaseReplicated.cpp`: previously the `catch (...)` blocks in `tryGetCluster`, `tryGetAllGroupsCluster`, and `tryGetReplicasInfo` downgraded *every* swallowed exception to `LogsLevel::information`. That includes unexpected exceptions from `getClusterImpl` (malformed Keeper payloads, logic bugs, ...) which we still want operators to notice. Restrict the downgrade to the expected coordination errors (`KEEPER_EXCEPTION` / `ALL_CONNECTION_TRIES_FAILED`); anything else stays at the default `error` level via the plain `tryLogCurrentException(log)`. 2. `tests/queries/0_stateless/04252_database_replicated_system_clusters_log_level.sh`: move the database / metadata / Keeper cleanup into a `trap cleanup EXIT` handler. The test deliberately leaves the `Replicated` database in a non-droppable state (it deletes `${ZK_PATH}/replicas`), so without the trap a single `exit 1` on assertion failure would leave the orphan attached and the next run's `CREATE DATABASE` would fail.

groeneai · 2026-05-17T18:23:53Z

Thanks @clickhouse-gh — both review points addressed in a8ce315:

DatabaseReplicated.cpp (tryGetCluster / tryGetAllGroupsCluster / tryGetReplicasInfo): the catch (...) block now only downgrades the expected coordination/connection failures — KEEPER_EXCEPTION and ALL_CONNECTION_TRIES_FAILED — to LogsLevel::information. Anything else falls through to the plain tryLogCurrentException(log) and stays at the default error level, so malformed Keeper payloads or unexpected logic errors in getClusterImpl remain visible to operators.
```
const auto code = getCurrentExceptionCode();
if (code == ErrorCodes::KEEPER_EXCEPTION || code == ErrorCodes::ALL_CONNECTION_TRIES_FAILED)
    tryLogCurrentException(log, \"…\", LogsLevel::information);
else
    tryLogCurrentException(log);
```
04252_database_replicated_system_clusters_log_level.sh: moved the database DETACH / metadata-file removal / Keeper rmr cleanup into a trap cleanup EXIT handler so it runs on both the success and failure paths. Confirmed locally with bash -x that cleanup is invoked after echo OK and after exit 1.

(Session: cron:clickhouse-ci-task-worker:20260517-181500)

alexey-milovidov · 2026-05-17T18:41:30Z

This was fixed by #105146. Let's update the branch.

groeneai · 2026-05-18T02:02:06Z

Addressed the bot's NO_ACTIVE_REPLICAS gap in commit ec409aeec3ef.

src/Databases/DatabaseReplicated.cpp: NO_ACTIVE_REPLICAS (code 254) is now in the downgrade predicate for both tryGetCluster and tryGetAllGroupsCluster. Both call getClusterImpl, which throws NO_ACTIVE_REPLICAS at line 418 when /replicas exists but is empty (first replica not yet created / last replica just dropped) -- the same concurrent-lifecycle window as KEEPER_EXCEPTION (999) and ALL_CONNECTION_TRIES_FAILED (279). tryGetReplicasInfo does not call getClusterImpl and cannot throw NO_ACTIVE_REPLICAS, so its catch is unchanged.

New regression test tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh (added as a new test rather than extending 04252 per CLAUDE.md guidance): same setup as 04252 but recreates /replicas as an empty znode via rmr ${ZK_PATH}/replicas followed by touch ${ZK_PATH}/replicas -- the next getClusterImpl call deterministically goes through the NO_ACTIVE_REPLICAS branch. Asserts Code: 254 does not leak to client stderr at send_logs_level=warning.

Bidirectional verification (local debug build, clang-21):

Without this commit (only a8ce3153d38f applied): test 04254 FAILS with leaked <Error> ... Code: 254. DB::Exception: No replicas of database ... (NO_ACTIVE_REPLICAS) on client stderr.
With this commit: test 04254 PASSES, 30/30 runs with --test-runs 30 (full settings randomization). Existing test 04252 also passes 20/20.

CI starting now.

tuanpach · 2026-05-21T09:34:10Z

@groeneai Check the CI failure and address the unresolved comments.

`NO_ACTIVE_REPLICAS` is `Code: 254`, not `Code: 279` (which is `ALL_CONNECTION_TRIES_FAILED`). Updates two comment references in `tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh` so the explanation matches the exception branch actually exercised. Comment-only change. No behavior change. Addresses ClickHouse#105149 (comment)...

groeneai · 2026-05-21T10:18:55Z

Hi @tuanpach, thanks for the ping. Status update:

Bot's inline comment (254 vs 279): addressed in commit 6e35b7a9ad81. Two comment-only updates in tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh so the explanation matches the exception branch (NO_ACTIVE_REPLICAS is Code: 254, not Code: 279).

CI failure analysis (commit ec409aeec3ef): only one failing job, and it's a pre-existing chronic flake unrelated to this PR.

Job	Test	Classification
`Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel)`	`02180_group_by_lowcardinality`	Chronic flake on this exact variant. CIDB shows 148 distinct PRs hit it in the last 30 days (20 master + 165 PR failures), and 184 of 185 (99.5%) of all failures concentrate on the same `amd_llvm_coverage, ParallelReplicas, s3 storage, parallel` check_name. The diff is one stray `{"val":null,"avg(toUInt32(val))":null}` row at the top and a missing `{"val":"232.4",...}` row at the bottom -- a row-order non-determinism under parallel-replica result merging, which the test does not pin. Completely orthogonal to `DatabaseReplicated::tryGetCluster` log levels.

All other 200+ jobs pass: builds (all archs), Fast test, Style check, Stateless tests (debug/asan/tsan/msan/s3/distributed/old-analyzer/etc.), Integration tests (all 6 shards on each sanitizer), Stress tests, AST fuzzer, BuzzHouse, Unit tests, Upgrade check, libFuzzer, Code Review.

CIDB query for reference:

SELECT check_name, count() AS hits
FROM default.checks
WHERE test_name = '02180_group_by_lowcardinality'
  AND test_status IN ('FAIL','ERROR')
  AND check_start_time > now() - INTERVAL 30 DAY
GROUP BY check_name ORDER BY hits DESC;
-- Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel)   184
-- Fast test                                                                       1

I will track 02180_group_by_lowcardinality separately as a chronic flake; not a blocker for this PR.

Ready for review on top of the new commit. Let me know if you want anything else changed before merge.

groeneai · 2026-05-21T10:22:29Z

Quick update on the CI failure: I missed that 02180_group_by_lowcardinality was already stabilized by merged PR #105194 (2026-05-18 08:08 UTC). The CI on ec409aeec3ef failed because the branch's last master sync was 79a216415 from 2026-05-17 21:10 UTC -- 11 hours before the fix landed.

Pushed a master merge as commit 6d379a21ad29. New CI is rerunning. After CI green, ready for your merge.

`clickhouse-gh[bot]` review on PR ClickHouse#105149 noted that the existing regression tests (04252, 04254) use `SELECT count() FROM system.clusters`, which only reads metadata and never invokes the lambda that calls `DatabaseReplicated::tryGetReplicasInfo`. `StorageSystemClusters::writeCluster` only calls `replicas_info_getter` when the query reads one of the replica state columns: `is_active`, `unsynced_after_recovery`, `replication_lag`, `recovery_time`, or `is_shared_catalog_cluster`. So the `tryGetReplicasInfo` log-level change in this PR had no test coverage. Add `04278_database_replicated_system_clusters_replicas_info.sh` which: 1. Creates a `Replicated` database (so `/replicas` is populated and `tryGetCluster` returns a valid cluster on subsequent calls). 2. Primes `tryGetCluster` so the cluster is cached. 3. Deletes `${ZK_PATH}/max_log_ptr` so the `tryGet(paths)` in `tryGetReplicasInfo` returns `ZNONODE` for `paths[0]` and the function throws `Coordination::Exception(ZNONODE)`, which maps to `KEEPER_EXCEPTION` (`Code: 999`). 4. Queries `SELECT is_active FROM system.clusters` at the default `send_logs_level=warning` and asserts no `DatabaseReplicated.*Code:` leaks to client stderr. Verified bidirectionally on local debug build: - Without the fix (catch reverted to bare `tryLogCurrentException(log)`) the test FAILS with the exact expected `<Error> DatabaseReplicated ... Code: 999. Coordination::Exception: Coordination error: No node. (KEEPER_EXCEPTION)` line leaking to client stderr. - With the fix the test PASSES, 10/10 runs with full settings randomization, alongside 04252 and 04254 (30/30 across all three).

…ception `DatabaseReplicated::tryGetCluster` (and its siblings `tryGetAllGroupsCluster` and `tryGetReplicasInfo`) catch exceptions thrown while reading the database's `/replicas` znode and return nullptr so callers (e.g. `system.clusters` filling) silently skip the database. The catch block was logging the swallowed exception via `tryLogCurrentException(log)`, which defaults to `LogsLevel::error`. In CI the test runner sets `send_logs_level=warning`, so any exception logged at error level is forwarded to the client stderr. When two parallel stateless tests are creating/dropping their Replicated test databases, an unrelated `SHOW CLUSTERS` (e.g. `01293_show_clusters`) that happens to iterate over the in-flight Replicated database picks up the now-leaked `Code: 999. Coordination::Exception: No node, path /clickhouse/databases/.../replicas` or `Code: 279. No active replicas` in its stderr and is reported as failed -- even though every assertion passes and the auto-rerun budget gets 73/73 passes. Lower the level to `LogsLevel::information` with a short context message in all three catches. The message is still visible in normal server logs (information is above the production logger default), but it is below the `warning` threshold so it no longer leaks into client stderr. This matches the pattern already used elsewhere in the file (e.g. line 2168, `Async loading failed` logged at `warning`). A new stateless regression test `04252_database_replicated_system_clusters_log_level.sh` reproduces the failure path deterministically: it creates a Replicated database, deletes its `/replicas` znode via `keeper-client`, queries `system.clusters` at `send_logs_level=warning`, and asserts that no `DatabaseReplicated.*Code: (279|999)` text leaks into the captured client stderr. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105047&sha=dfa4712b3435d65558081aeb9ce3bc19f6d84a07&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29

…trap cleanup Per @clickhouse-gh[bot] review feedback on PR ClickHouse#105149: 1. `src/Databases/DatabaseReplicated.cpp`: previously the `catch (...)` blocks in `tryGetCluster`, `tryGetAllGroupsCluster`, and `tryGetReplicasInfo` downgraded *every* swallowed exception to `LogsLevel::information`. That includes unexpected exceptions from `getClusterImpl` (malformed Keeper payloads, logic bugs, ...) which we still want operators to notice. Restrict the downgrade to the expected coordination errors (`KEEPER_EXCEPTION` / `ALL_CONNECTION_TRIES_FAILED`); anything else stays at the default `error` level via the plain `tryLogCurrentException(log)`. 2. `tests/queries/0_stateless/04252_database_replicated_system_clusters_log_level.sh`: move the database / metadata / Keeper cleanup into a `trap cleanup EXIT` handler. The test deliberately leaves the `Replicated` database in a non-droppable state (it deletes `${ZK_PATH}/replicas`), so without the trap a single `exit 1` on assertion failure would leave the orphan attached and the next run's `CREATE DATABASE` would fail.

`getClusterImpl` throws `ErrorCodes::NO_ACTIVE_REPLICAS` (code 254) when the `/replicas` znode exists but is empty -- i.e. the first replica is not fully created yet or the last replica was just dropped. This is the same concurrent-lifecycle window the previous commit (a8ce315) already handles for `KEEPER_EXCEPTION` (999) and `ALL_CONNECTION_TRIES_FAILED` (279), so it should be logged at `<Information>` for the same reason: the catch returns nullptr to the caller (which skips the database) and the noisy `<Error>` log line otherwise leaks to client stderr at the default `send_logs_level=warning` and turns `system.clusters` queries flaky. Apply the same predicate update to both `tryGetCluster` and `tryGetAllGroupsCluster` (both reach `getClusterImpl`). `tryGetReplicasInfo` does not call `getClusterImpl` and cannot throw `NO_ACTIVE_REPLICAS`; its catch is left unchanged. Companion regression test 04254 mirrors 04252's setup but recreates `/replicas` as an empty znode (`rmr` + `touch`) so the next `getClusterImpl` call hits the `NO_ACTIVE_REPLICAS` branch deterministically. Verified locally: without this commit the test FAILS with "Code: 254. DB::Exception: No replicas of database ... (NO_ACTIVE_REPLICAS)" leaked to client stderr; with this commit the test PASSES (30 runs with randomization enabled, 30/30 OK).

`NO_ACTIVE_REPLICAS` is `Code: 254`, not `Code: 279` (which is `ALL_CONNECTION_TRIES_FAILED`). Updates two comment references in `tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh` so the explanation matches the exception branch actually exercised. Comment-only change. No behavior change. Addresses ClickHouse#105149 (comment)...

`clickhouse-gh[bot]` review on PR ClickHouse#105149 noted that the existing regression tests (04252, 04254) use `SELECT count() FROM system.clusters`, which only reads metadata and never invokes the lambda that calls `DatabaseReplicated::tryGetReplicasInfo`. `StorageSystemClusters::writeCluster` only calls `replicas_info_getter` when the query reads one of the replica state columns: `is_active`, `unsynced_after_recovery`, `replication_lag`, `recovery_time`, or `is_shared_catalog_cluster`. So the `tryGetReplicasInfo` log-level change in this PR had no test coverage. Add `04278_database_replicated_system_clusters_replicas_info.sh` which: 1. Creates a `Replicated` database (so `/replicas` is populated and `tryGetCluster` returns a valid cluster on subsequent calls). 2. Primes `tryGetCluster` so the cluster is cached. 3. Deletes `${ZK_PATH}/max_log_ptr` so the `tryGet(paths)` in `tryGetReplicasInfo` returns `ZNONODE` for `paths[0]` and the function throws `Coordination::Exception(ZNONODE)`, which maps to `KEEPER_EXCEPTION` (`Code: 999`). 4. Queries `SELECT is_active FROM system.clusters` at the default `send_logs_level=warning` and asserts no `DatabaseReplicated.*Code:` leaks to client stderr. Verified bidirectionally on local debug build: - Without the fix (catch reverted to bare `tryLogCurrentException(log)`) the test FAILS with the exact expected `<Error> DatabaseReplicated ... Code: 999. Coordination::Exception: Coordination error: No node. (KEEPER_EXCEPTION)` line leaking to client stderr. - With the fix the test PASSES, 10/10 runs with full settings randomization, alongside 04252 and 04254 (30/30 across all three).

groeneai · 2026-06-23T09:15:48Z

CI finished on e86e6f6dd00 (rebased onto current master). The red checks are pre-existing, not introduced by this log-level change:

Stress test (amd_debug / arm_asan_ubsan / arm_msan) "Hung check failed, possible deadlock found": chronic trunk-wide stress flake, ~289 master hits / 30d.
Functional and Mergeable checks are green.

clickhouse-gh · 2026-06-23T14:11:35Z

Dear @tuanpach, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

groeneai · 2026-06-25T17:12:41Z

CI finish ledger — `e86e6f6`

CI has fully finished (Finish Workflow passed; last checks completed 2026-06-23). Every red check has an owner below; only CH Inc sync is exempt.

Check / test	Reason	Owner / fixing PR
Stress test (amd_debug) / Hung check failed, possible deadlock found	deadlock (chronic trunk: 976 hits / 433 PRs / 327 master in 30d — unrelated to this PR's DatabaseReplicated log-level change)	#105905 / #101680 (ours, open)
Stress test (arm_asan_ubsan) / Hung check failed, possible deadlock found	deadlock (same family)	#105905 / #101680 (ours, open)
Stress test (arm_msan) / Hung check failed, possible deadlock found	deadlock (same family)	#105905 / #101680 (ours, open)
PR	—	rollup of the Stress hung-check failures above (no separate cause)

This PR changes only DatabaseReplicated::tryGetCluster/tryGetAllGroupsCluster/tryGetReplicasInfo catch log levels + adds stateless regression tests — none of which touch the thread-pool/blob-removal shutdown paths the hung check exercises. The three Stress hung-check failures belong to the chronic shutdown-deadlock family already being addressed by #105905 and #101680. Ready for review.

Session id: cron:our-pr-ci-monitor:20260625-170000

tryGetReplicasInfo read `ddl_worker` for `unsynced_after_recovery` outside the `ddl_worker_mutex` block (the lock guarding the adjacent `recovery_time` read ended one statement earlier). `shutdown()` resets and frees `ddl_worker` under `ddl_worker_mutex`, so a `SELECT is_active/unsynced_after_recovery/... FROM system.clusters` running concurrently with `DROP/DETACH DATABASE` could dereference the freed worker -> heap-use-after-free. ThreadSanitizer caught this on PR ClickHouse#105149's flaky check (TSan): write of size 8 in `~DatabaseReplicatedDDLWorker` (DatabaseReplicated::shutdown, ddl_worker = nullptr) vs read in `DatabaseReplicatedDDLWorker::isUnsyncedAfterRecovery` reached from `StorageSystemClusters::fillData` -> `tryGetReplicasInfo`. Move the `isUnsyncedAfterRecovery()` read into the same `ddl_worker_mutex` block as the `recovery_time` read. Behavior is unchanged: both members are read only when `ddl_worker` is non-null; `unsynced_after_recovery` stays false when it is null. All writes to `ddl_worker` already hold `ddl_worker_mutex`, so the read is now race-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-29T01:29:58Z

Pre-PR validation gate for the new commit (e2317cabc0d, heap-use-after-free fix on ddl_worker in tryGetReplicasInfo):

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	The race window is between releasing `ddl_worker_mutex` and the unlocked `ddl_worker->isUnsyncedAfterRecovery()` read. Confirmed locally that a `SELECT is_active/unsynced_after_recovery/recovery_time FROM system.clusters` reaches that exact unlocked access (verified with an env-gated park: the reader sits at the access for ~5s). The race itself was caught deterministically by this PR's amd_tsan flaky check (STID 5804-51e6).
b	Root cause explained?	`tryGetReplicasInfo` reads `recovery_time` under `ddl_worker_mutex`, but the very next line reads `ddl_worker` again for `unsynced_after_recovery` after the `lock_guard` scope has ended. `shutdown()` does `ddl_worker = nullptr` (frees `~DatabaseReplicatedDDLWorker`) under `ddl_worker_mutex`. A concurrent `DROP/DETACH DATABASE` therefore frees the worker while the unlocked read dereferences it: heap-use-after-free.
c	Fix matches root cause?	Yes. The fix moves the `isUnsyncedAfterRecovery()` read into the same `ddl_worker_mutex` block as `recovery_time`, so every access to `ddl_worker` holds the lock that guards its reset/free. No symptom-guarding.
d	Test intent preserved / new tests added?	Existing 04252/04254/04278 unchanged and still pass. No new flaky concurrency test is added: a TSan-race test that races CREATE/DROP vs `system.clusters` would itself be non-deterministic (and the flaky check runs it 50x). The `tryGetReplicasInfo` path is already exercised functionally by 04278; the thread-safety fix is validated by TSan CI + the local TSan workload below.
e	Both directions demonstrated?	Without fix: amd_tsan flaky check reported the data race (write in `~DatabaseReplicatedDDLWorker` via `shutdown()` vs read in `isUnsyncedAfterRecovery` via `tryGetReplicasInfo:588`). With fix: rebuilt the amd_tsan binary (Build ID changed) and ran the same create / concurrent-`system.clusters`-read / DROP workload repeatedly under TSan: no race, no deadlock; 04252/04254/04278 all pass.
f	Fix is general across code paths?	Audited all `ddl_worker` dereferences in `DatabaseReplicated.cpp`. Line 588 (now fixed) was the only `system.clusters`-reachable read performed outside `ddl_worker_mutex`; all other accesses are already under the mutex, in debug-only `chassert`, or in enqueue/wait paths that the design does not run concurrently with shutdown. No sibling unlocked read remains.
g	Fix generalizes across inputs?	N/A for input types - this is a concurrency/lifetime fix, not a value/type-dependent code path. The race does not depend on column types or settings; it depends only on the concurrent `DROP`/`DETACH` timing, which the fix closes unconditionally.
h	Backward compatible?	Yes. No setting, on-disk/wire/replication format, or default changes. Pure in-process locking change; behavior of the produced `ReplicasInfo` is identical.
i	Invariants and contracts preserved?	Yes. The invariant "every read/write of `ddl_worker` is performed under `ddl_worker_mutex`" now holds on this path too. Semantics preserved: both members read only when `ddl_worker` is non-null; `unsynced_after_recovery` stays `false` when null (same as the old short-circuit `ddl_worker && ddl_worker->isUnsyncedAfterRecovery()`). Lock scope is only widened by one read of the already-held object, no new lock ordering introduced.

Session id: cron:clickhouse-worker-slot-1:20260629-003300

clickhouse-gh · 2026-06-29T04:58:31Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 25/35 (71.43%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

alexey-milovidov

Very good.

hanfei1991 added the can be tested Allows running workflows for external contributors label May 17, 2026

clickhouse-gh Bot added the pr-ci label May 17, 2026

clickhouse-gh Bot reviewed May 17, 2026

View reviewed changes

Comment thread src/Databases/DatabaseReplicated.cpp Outdated

clickhouse-gh Bot reviewed May 17, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04252_database_replicated_system_clusters_log_level.sh

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

clickhouse-gh Bot reviewed May 17, 2026

View reviewed changes

Comment thread src/Databases/DatabaseReplicated.cpp Outdated

clickhouse-gh Bot reviewed May 18, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh Outdated

tuanpach self-assigned this May 21, 2026

clickhouse-gh Bot reviewed May 26, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04252_database_replicated_system_clusters_log_level.sh

groeneai added 5 commits June 23, 2026 02:46

groeneai force-pushed the groeneai/fix-01293-show-clusters-stderr-noise branch from 1a1eb1e to e86e6f6 Compare June 23, 2026 02:48

clickhouse-gh Bot unassigned tuanpach Jun 23, 2026

alexey-milovidov and others added 2 commits June 28, 2026 23:01

Merge branch 'master' into groeneai/fix-01293-show-clusters-stderr-noise

9cb2bfe

clickhouse-gh Bot added pr-bugfix Pull request with bugfix, not backported by default and removed pr-ci labels Jun 29, 2026

alexey-milovidov approved these changes Jun 29, 2026

View reviewed changes

alexey-milovidov self-assigned this Jun 29, 2026

alexey-milovidov added this pull request to the merge queue Jun 29, 2026

Merged via the queue into ClickHouse:master with commit 6802478 Jun 29, 2026
174 checks passed

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 30, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented May 17, 2026 • edited by robot-ch-test-poll4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

1. Flaky 01293_show_clusters (log level) - original PR scope

2. Data race / heap-use-after-free on ddl_worker in tryGetReplicasInfo

Related

Version info

Uh oh!

groeneai commented May 17, 2026

Pre-PR validation gate (per TASK.md Phase 4 Step 9)

Uh oh!

groeneai commented May 17, 2026

Uh oh!

clickhouse-gh Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

groeneai commented May 17, 2026

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

Uh oh!

groeneai commented May 18, 2026

Uh oh!

Uh oh!

tuanpach commented May 21, 2026

Uh oh!

groeneai commented May 21, 2026

Uh oh!

groeneai commented May 21, 2026

Uh oh!

Uh oh!

groeneai commented Jun 23, 2026

Uh oh!

clickhouse-gh Bot commented Jun 23, 2026

Uh oh!

groeneai commented Jun 25, 2026

CI finish ledger — e86e6f6

Uh oh!

groeneai commented Jun 29, 2026

Uh oh!

clickhouse-gh Bot commented Jun 29, 2026

LLVM Coverage Report

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

groeneai commented May 17, 2026 •

edited by robot-ch-test-poll4

Loading

1. Flaky `01293_show_clusters` (log level) - original PR scope

2. Data race / heap-use-after-free on `ddl_worker` in `tryGetReplicasInfo`

clickhouse-gh Bot commented May 17, 2026 •

edited

Loading

CI finish ledger — `e86e6f6`