Fix flaky 01293_show_clusters: lower log level in DatabaseReplicated::tryGetCluster by groeneai · Pull Request #105149 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix flaky 01293_show_clusters: lower log level in DatabaseReplicated::tryGetCluster#105149

Merged
alexey-milovidov merged 7 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-01293-show-clusters-stderr-noise
Jun 29, 2026
Merged

Fix flaky 01293_show_clusters: lower log level in DatabaseReplicated::tryGetCluster#105149
alexey-milovidov merged 7 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-01293-show-clusters-stderr-noise

Conversation

@groeneai

@groeneai groeneai commented May 17, 2026

Copy link
Copy Markdown
Contributor

Related: #67732

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a heap-use-after-free in DatabaseReplicated::tryGetReplicasInfo (reachable by querying system.clusters for replica-state columns such as is_active / unsynced_after_recovery / recovery_time while a Replicated database is being dropped or detached), and lowered the log level of swallowed Keeper exceptions in DatabaseReplicated::tryGetCluster, tryGetAllGroupsCluster, and tryGetReplicasInfo from <Error> to <Information>. The exceptions are caught and ignored by design (callers such as system.clusters treat the affected database as transiently unavailable and skip it), but previously the error-level log line was forwarded to client stderr at the default send_logs_level=warning, causing unrelated tests touching system.clusters (e.g. 01293_show_clusters) to appear failed on parallel stateless lanes.

Description

This PR fixes two related issues in DatabaseReplicated's system.clusters read path.

1. Flaky 01293_show_clusters (log level) - original PR scope

DatabaseReplicated::tryGetCluster and the sibling tryGetAllGroupsCluster / tryGetReplicasInfo intentionally swallow exceptions thrown while reading the database's /replicas Keeper state. The existing comment in tryGetCluster explains why:

A quick fix for stateless tests with DatabaseReplicated. Its ZK node can be destroyed at any time. If another test lists system.clusters to get client command line suggestions, it will get an error when trying to get the info about DB from ZK. Just ignore these inaccessible databases.

The exception is swallowed (cluster stays null, callers skip the database), but the catch block was logging it via tryLogCurrentException(log), which defaults to LogsLevel::error. The CI test runner forwards server logs at send_logs_level=warning and above to the client stderr, so the swallowed exception still showed up as test stderr noise, reporting unrelated system.clusters queries (e.g. 01293_show_clusters) as failed on arm_binary / amd_debug / s3 lanes.

The three catches now log via tryLogCurrentException(log, "...", LogsLevel::information) for the expected coordination/connection codes (KEEPER_EXCEPTION, ALL_CONNECTION_TRIES_FAILED, NO_ACTIVE_REPLICAS); anything else stays at the default error level so genuine problems remain visible to operators.

Regression tests:

  • 04252_database_replicated_system_clusters_log_level.sh - deletes /replicas, asserts no Code: (279|999) leaks to client stderr.
  • 04254_database_replicated_system_clusters_no_active_replicas.sh - empty /replicas, asserts no Code: 254 (NO_ACTIVE_REPLICAS) leak.
  • 04278_database_replicated_system_clusters_replicas_info.sh - deletes /max_log_ptr, exercises the tryGetReplicasInfo path specifically.

2. Data race / heap-use-after-free on ddl_worker in tryGetReplicasInfo

The TSan flaky check on this PR caught a real, pre-existing data race in tryGetReplicasInfo:

WARNING: ThreadSanitizer: data race
  Write of size 8 by thread T (DROP DATABASE):
    DatabaseReplicated::shutdown()                 -> ddl_worker = nullptr (frees the worker, under ddl_worker_mutex)
    operator delete / ~DatabaseReplicatedDDLWorker  src/Databases/DatabaseReplicatedWorker.h:23
  Previous read of size 8 by thread U (SELECT ... FROM system.clusters):
    DatabaseReplicatedDDLWorker::isUnsyncedAfterRecovery()  src/Databases/DatabaseReplicatedWorker.h:43
    DatabaseReplicated::tryGetReplicasInfo()                src/Databases/DatabaseReplicated.cpp:588
    StorageSystemClusters::fillData()

In tryGetReplicasInfo, the recovery_time read was correctly performed under ddl_worker_mutex, but the immediately following unsynced_after_recovery = ddl_worker && ddl_worker->isUnsyncedAfterRecovery() read happened outside that lock (the lock_guard scope had already ended). shutdown() resets and frees ddl_worker under ddl_worker_mutex, so a SELECT is_active / unsynced_after_recovery / recovery_time FROM system.clusters running concurrently with DROP/DETACH DATABASE could dereference the freed worker - a heap-use-after-free (also reachable in release builds, not just under sanitizers).

The fix moves the isUnsyncedAfterRecovery() read into the same ddl_worker_mutex block as the recovery_time read. Behavior is unchanged: both members are read only when ddl_worker is non-null, and unsynced_after_recovery stays false when it is null. All writes to ddl_worker already hold ddl_worker_mutex, so the read is now race-free.

CI provenance for the race (this PR's flaky check):

Verification:

  • Built the amd_tsan binary with the fix; the natural workload (create Replicated DB, concurrent SELECT is_active/unsynced_after_recovery/recovery_time FROM system.clusters, DROP DATABASE, repeated under TSan) runs clean - no data race, no deadlock.
  • Confirmed the reader reaches the unlocked ddl_worker access in tryGetReplicasInfo (the affected line) before the fix.
  • All three regression tests (04252 / 04254 / 04278) pass on the TSan build.

Related

tests/queries/0_stateless/03206_replication_lag_metric.sh worked around the same stderr noise by setting CLICKHOUSE_CLIENT_SERVER_LOGS_LEVEL=none. That workaround can be revisited in a follow-up now that the underlying log-level mismatch is fixed at the source.

Version info

  • Merged into: 26.7.1.276

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @tavplubix @alesapin — could you review this? Three log-level changes in DatabaseReplicated (tryGetCluster, tryGetAllGroupsCluster, tryGetReplicasInfo) lowering swallowed KEEPER_EXCEPTION / ALL_CONNECTION_TRIES_FAILED from <Error> to <Information> so they stop leaking into client stderr at send_logs_level=warning and breaking unrelated parallel stateless tests (e.g. 01293_show_clusters). Comes with a new deterministic regression test 04252_database_replicated_system_clusters_log_level.sh.

@hanfei1991 hanfei1991 added the can be tested Allows running workflows for external contributors label May 17, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [e2317ca]

Summary:


AI Review

Summary

This PR lowers the log level for expected transient DatabaseReplicated Keeper/read races in the system.clusters path, keeps unexpected exceptions at error, and fixes the ddl_worker lifetime race in tryGetReplicasInfo by reading it under ddl_worker_mutex. The current tests cover the changed tryGetCluster, tryGetAllGroupsCluster, and tryGetReplicasInfo paths, and I did not find any new blocker or major issue in the current diff.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 17, 2026
Comment thread src/Databases/DatabaseReplicated.cpp Outdated
groeneai added a commit to groeneai/ClickHouse that referenced this pull request May 17, 2026
…trap cleanup

Per @clickhouse-gh[bot] review feedback on PR ClickHouse#105149:

1. `src/Databases/DatabaseReplicated.cpp`: previously the `catch (...)` blocks
   in `tryGetCluster`, `tryGetAllGroupsCluster`, and `tryGetReplicasInfo`
   downgraded *every* swallowed exception to `LogsLevel::information`. That
   includes unexpected exceptions from `getClusterImpl` (malformed Keeper
   payloads, logic bugs, ...) which we still want operators to notice.
   Restrict the downgrade to the expected coordination errors
   (`KEEPER_EXCEPTION` / `ALL_CONNECTION_TRIES_FAILED`); anything else
   stays at the default `error` level via the plain `tryLogCurrentException(log)`.

2. `tests/queries/0_stateless/04252_database_replicated_system_clusters_log_level.sh`:
   move the database / metadata / Keeper cleanup into a `trap cleanup EXIT`
   handler. The test deliberately leaves the `Replicated` database in a
   non-droppable state (it deletes `${ZK_PATH}/replicas`), so without the
   trap a single `exit 1` on assertion failure would leave the orphan
   attached and the next run's `CREATE DATABASE` would fail.
@groeneai

Copy link
Copy Markdown
Contributor Author

Thanks @clickhouse-gh — both review points addressed in a8ce315:

  1. DatabaseReplicated.cpp (tryGetCluster / tryGetAllGroupsCluster / tryGetReplicasInfo): the catch (...) block now only downgrades the expected coordination/connection failures — KEEPER_EXCEPTION and ALL_CONNECTION_TRIES_FAILED — to LogsLevel::information. Anything else falls through to the plain tryLogCurrentException(log) and stays at the default error level, so malformed Keeper payloads or unexpected logic errors in getClusterImpl remain visible to operators.

    const auto code = getCurrentExceptionCode();
    if (code == ErrorCodes::KEEPER_EXCEPTION || code == ErrorCodes::ALL_CONNECTION_TRIES_FAILED)
        tryLogCurrentException(log, \"\", LogsLevel::information);
    else
        tryLogCurrentException(log);
  2. 04252_database_replicated_system_clusters_log_level.sh: moved the database DETACH / metadata-file removal / Keeper rmr cleanup into a trap cleanup EXIT handler so it runs on both the success and failure paths. Confirmed locally with bash -x that cleanup is invoked after echo OK and after exit 1.

(Session: cron:clickhouse-ci-task-worker:20260517-181500)

@alexey-milovidov

Copy link
Copy Markdown
Member

This was fixed by #105146. Let's update the branch.

Comment thread src/Databases/DatabaseReplicated.cpp Outdated
@groeneai

Copy link
Copy Markdown
Contributor Author

Addressed the bot's NO_ACTIVE_REPLICAS gap in commit ec409aeec3ef.

src/Databases/DatabaseReplicated.cpp: NO_ACTIVE_REPLICAS (code 254) is now in the downgrade predicate for both tryGetCluster and tryGetAllGroupsCluster. Both call getClusterImpl, which throws NO_ACTIVE_REPLICAS at line 418 when /replicas exists but is empty (first replica not yet created / last replica just dropped) -- the same concurrent-lifecycle window as KEEPER_EXCEPTION (999) and ALL_CONNECTION_TRIES_FAILED (279). tryGetReplicasInfo does not call getClusterImpl and cannot throw NO_ACTIVE_REPLICAS, so its catch is unchanged.

New regression test tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh (added as a new test rather than extending 04252 per CLAUDE.md guidance): same setup as 04252 but recreates /replicas as an empty znode via rmr ${ZK_PATH}/replicas followed by touch ${ZK_PATH}/replicas -- the next getClusterImpl call deterministically goes through the NO_ACTIVE_REPLICAS branch. Asserts Code: 254 does not leak to client stderr at send_logs_level=warning.

Bidirectional verification (local debug build, clang-21):

  • Without this commit (only a8ce3153d38f applied): test 04254 FAILS with leaked <Error> ... Code: 254. DB::Exception: No replicas of database ... (NO_ACTIVE_REPLICAS) on client stderr.
  • With this commit: test 04254 PASSES, 30/30 runs with --test-runs 30 (full settings randomization). Existing test 04252 also passes 20/20.

CI starting now.

@tuanpach tuanpach self-assigned this May 21, 2026
@tuanpach

Copy link
Copy Markdown
Member

@groeneai Check the CI failure and address the unresolved comments.

groeneai added a commit to groeneai/ClickHouse that referenced this pull request May 21, 2026
`NO_ACTIVE_REPLICAS` is `Code: 254`, not `Code: 279` (which is
`ALL_CONNECTION_TRIES_FAILED`). Updates two comment references in
`tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh`
so the explanation matches the exception branch actually exercised.

Comment-only change. No behavior change.

Addresses ClickHouse#105149 (comment)...
@groeneai

Copy link
Copy Markdown
Contributor Author

Hi @tuanpach, thanks for the ping. Status update:

Bot's inline comment (254 vs 279): addressed in commit 6e35b7a9ad81. Two comment-only updates in tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh so the explanation matches the exception branch (NO_ACTIVE_REPLICAS is Code: 254, not Code: 279).

CI failure analysis (commit ec409aeec3ef): only one failing job, and it's a pre-existing chronic flake unrelated to this PR.

Job Test Classification
Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel) 02180_group_by_lowcardinality Chronic flake on this exact variant. CIDB shows 148 distinct PRs hit it in the last 30 days (20 master + 165 PR failures), and 184 of 185 (99.5%) of all failures concentrate on the same amd_llvm_coverage, ParallelReplicas, s3 storage, parallel check_name. The diff is one stray {"val":null,"avg(toUInt32(val))":null} row at the top and a missing {"val":"232.4",...} row at the bottom -- a row-order non-determinism under parallel-replica result merging, which the test does not pin. Completely orthogonal to DatabaseReplicated::tryGetCluster log levels.

All other 200+ jobs pass: builds (all archs), Fast test, Style check, Stateless tests (debug/asan/tsan/msan/s3/distributed/old-analyzer/etc.), Integration tests (all 6 shards on each sanitizer), Stress tests, AST fuzzer, BuzzHouse, Unit tests, Upgrade check, libFuzzer, Code Review.

CIDB query for reference:

SELECT check_name, count() AS hits
FROM default.checks
WHERE test_name = '02180_group_by_lowcardinality'
  AND test_status IN ('FAIL','ERROR')
  AND check_start_time > now() - INTERVAL 30 DAY
GROUP BY check_name ORDER BY hits DESC;
-- Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel)   184
-- Fast test                                                                       1

I will track 02180_group_by_lowcardinality separately as a chronic flake; not a blocker for this PR.

Ready for review on top of the new commit. Let me know if you want anything else changed before merge.

@groeneai

Copy link
Copy Markdown
Contributor Author

Quick update on the CI failure: I missed that 02180_group_by_lowcardinality was already stabilized by merged PR #105194 (2026-05-18 08:08 UTC). The CI on ec409aeec3ef failed because the branch's last master sync was 79a216415 from 2026-05-17 21:10 UTC -- 11 hours before the fix landed.

Pushed a master merge as commit 6d379a21ad29. New CI is rerunning. After CI green, ready for your merge.

groeneai added a commit to groeneai/ClickHouse that referenced this pull request May 26, 2026
`clickhouse-gh[bot]` review on PR ClickHouse#105149 noted that the existing
regression tests (04252, 04254) use `SELECT count() FROM system.clusters`,
which only reads metadata and never invokes the lambda that calls
`DatabaseReplicated::tryGetReplicasInfo`. `StorageSystemClusters::writeCluster`
only calls `replicas_info_getter` when the query reads one of the replica
state columns: `is_active`, `unsynced_after_recovery`, `replication_lag`,
`recovery_time`, or `is_shared_catalog_cluster`. So the `tryGetReplicasInfo`
log-level change in this PR had no test coverage.

Add `04278_database_replicated_system_clusters_replicas_info.sh` which:

1. Creates a `Replicated` database (so `/replicas` is populated and
   `tryGetCluster` returns a valid cluster on subsequent calls).
2. Primes `tryGetCluster` so the cluster is cached.
3. Deletes `${ZK_PATH}/max_log_ptr` so the `tryGet(paths)` in
   `tryGetReplicasInfo` returns `ZNONODE` for `paths[0]` and the function
   throws `Coordination::Exception(ZNONODE)`, which maps to
   `KEEPER_EXCEPTION` (`Code: 999`).
4. Queries `SELECT is_active FROM system.clusters` at the default
   `send_logs_level=warning` and asserts no `DatabaseReplicated.*Code:`
   leaks to client stderr.

Verified bidirectionally on local debug build:
- Without the fix (catch reverted to bare `tryLogCurrentException(log)`)
  the test FAILS with the exact expected `<Error> DatabaseReplicated ...
  Code: 999. Coordination::Exception: Coordination error: No node.
  (KEEPER_EXCEPTION)` line leaking to client stderr.
- With the fix the test PASSES, 10/10 runs with full settings
  randomization, alongside 04252 and 04254 (30/30 across all three).
groeneai added 5 commits June 23, 2026 02:46
…ception

`DatabaseReplicated::tryGetCluster` (and its siblings
`tryGetAllGroupsCluster` and `tryGetReplicasInfo`) catch exceptions
thrown while reading the database's `/replicas` znode and return nullptr
so callers (e.g. `system.clusters` filling) silently skip the database.
The catch block was logging the swallowed exception via
`tryLogCurrentException(log)`, which defaults to `LogsLevel::error`.

In CI the test runner sets `send_logs_level=warning`, so any exception
logged at error level is forwarded to the client stderr. When two
parallel stateless tests are creating/dropping their Replicated test
databases, an unrelated `SHOW CLUSTERS` (e.g. `01293_show_clusters`)
that happens to iterate over the in-flight Replicated database picks
up the now-leaked `Code: 999. Coordination::Exception: No node, path
/clickhouse/databases/.../replicas` or `Code: 279. No active replicas`
in its stderr and is reported as failed -- even though every assertion
passes and the auto-rerun budget gets 73/73 passes.

Lower the level to `LogsLevel::information` with a short context
message in all three catches. The message is still visible in normal
server logs (information is above the production logger default), but
it is below the `warning` threshold so it no longer leaks into client
stderr. This matches the pattern already used elsewhere in the file
(e.g. line 2168, `Async loading failed` logged at `warning`).

A new stateless regression test
`04252_database_replicated_system_clusters_log_level.sh` reproduces
the failure path deterministically: it creates a Replicated database,
deletes its `/replicas` znode via `keeper-client`, queries
`system.clusters` at `send_logs_level=warning`, and asserts that no
`DatabaseReplicated.*Code: (279|999)` text leaks into the captured
client stderr.

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105047&sha=dfa4712b3435d65558081aeb9ce3bc19f6d84a07&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29
…trap cleanup

Per @clickhouse-gh[bot] review feedback on PR ClickHouse#105149:

1. `src/Databases/DatabaseReplicated.cpp`: previously the `catch (...)` blocks
   in `tryGetCluster`, `tryGetAllGroupsCluster`, and `tryGetReplicasInfo`
   downgraded *every* swallowed exception to `LogsLevel::information`. That
   includes unexpected exceptions from `getClusterImpl` (malformed Keeper
   payloads, logic bugs, ...) which we still want operators to notice.
   Restrict the downgrade to the expected coordination errors
   (`KEEPER_EXCEPTION` / `ALL_CONNECTION_TRIES_FAILED`); anything else
   stays at the default `error` level via the plain `tryLogCurrentException(log)`.

2. `tests/queries/0_stateless/04252_database_replicated_system_clusters_log_level.sh`:
   move the database / metadata / Keeper cleanup into a `trap cleanup EXIT`
   handler. The test deliberately leaves the `Replicated` database in a
   non-droppable state (it deletes `${ZK_PATH}/replicas`), so without the
   trap a single `exit 1` on assertion failure would leave the orphan
   attached and the next run's `CREATE DATABASE` would fail.
`getClusterImpl` throws `ErrorCodes::NO_ACTIVE_REPLICAS` (code 254) when
the `/replicas` znode exists but is empty -- i.e. the first replica is
not fully created yet or the last replica was just dropped. This is the
same concurrent-lifecycle window the previous commit (a8ce315)
already handles for `KEEPER_EXCEPTION` (999) and
`ALL_CONNECTION_TRIES_FAILED` (279), so it should be logged at
`<Information>` for the same reason: the catch returns nullptr to the
caller (which skips the database) and the noisy `<Error>` log line
otherwise leaks to client stderr at the default
`send_logs_level=warning` and turns `system.clusters` queries flaky.

Apply the same predicate update to both `tryGetCluster` and
`tryGetAllGroupsCluster` (both reach `getClusterImpl`).
`tryGetReplicasInfo` does not call `getClusterImpl` and cannot throw
`NO_ACTIVE_REPLICAS`; its catch is left unchanged.

Companion regression test 04254 mirrors 04252's setup but recreates
`/replicas` as an empty znode (`rmr` + `touch`) so the next
`getClusterImpl` call hits the `NO_ACTIVE_REPLICAS` branch deterministically.
Verified locally: without this commit the test FAILS with
"Code: 254. DB::Exception: No replicas of database ... (NO_ACTIVE_REPLICAS)"
leaked to client stderr; with this commit the test PASSES (30 runs with
randomization enabled, 30/30 OK).
`NO_ACTIVE_REPLICAS` is `Code: 254`, not `Code: 279` (which is
`ALL_CONNECTION_TRIES_FAILED`). Updates two comment references in
`tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh`
so the explanation matches the exception branch actually exercised.

Comment-only change. No behavior change.

Addresses ClickHouse#105149 (comment)...
`clickhouse-gh[bot]` review on PR ClickHouse#105149 noted that the existing
regression tests (04252, 04254) use `SELECT count() FROM system.clusters`,
which only reads metadata and never invokes the lambda that calls
`DatabaseReplicated::tryGetReplicasInfo`. `StorageSystemClusters::writeCluster`
only calls `replicas_info_getter` when the query reads one of the replica
state columns: `is_active`, `unsynced_after_recovery`, `replication_lag`,
`recovery_time`, or `is_shared_catalog_cluster`. So the `tryGetReplicasInfo`
log-level change in this PR had no test coverage.

Add `04278_database_replicated_system_clusters_replicas_info.sh` which:

1. Creates a `Replicated` database (so `/replicas` is populated and
   `tryGetCluster` returns a valid cluster on subsequent calls).
2. Primes `tryGetCluster` so the cluster is cached.
3. Deletes `${ZK_PATH}/max_log_ptr` so the `tryGet(paths)` in
   `tryGetReplicasInfo` returns `ZNONODE` for `paths[0]` and the function
   throws `Coordination::Exception(ZNONODE)`, which maps to
   `KEEPER_EXCEPTION` (`Code: 999`).
4. Queries `SELECT is_active FROM system.clusters` at the default
   `send_logs_level=warning` and asserts no `DatabaseReplicated.*Code:`
   leaks to client stderr.

Verified bidirectionally on local debug build:
- Without the fix (catch reverted to bare `tryLogCurrentException(log)`)
  the test FAILS with the exact expected `<Error> DatabaseReplicated ...
  Code: 999. Coordination::Exception: Coordination error: No node.
  (KEEPER_EXCEPTION)` line leaking to client stderr.
- With the fix the test PASSES, 10/10 runs with full settings
  randomization, alongside 04252 and 04254 (30/30 across all three).
@groeneai groeneai force-pushed the groeneai/fix-01293-show-clusters-stderr-noise branch from 1a1eb1e to e86e6f6 Compare June 23, 2026 02:48
@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on e86e6f6dd00 (rebased onto current master). The red checks are pre-existing, not introduced by this log-level change:

  • Stress test (amd_debug / arm_asan_ubsan / arm_msan) "Hung check failed, possible deadlock found": chronic trunk-wide stress flake, ~289 master hits / 30d.
    Functional and Mergeable checks are green.

@clickhouse-gh

clickhouse-gh Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Dear @tuanpach, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — e86e6f6

CI has fully finished (Finish Workflow passed; last checks completed 2026-06-23). Every red check has an owner below; only CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
Stress test (amd_debug) / Hung check failed, possible deadlock found deadlock (chronic trunk: 976 hits / 433 PRs / 327 master in 30d — unrelated to this PR's DatabaseReplicated log-level change) #105905 / #101680 (ours, open)
Stress test (arm_asan_ubsan) / Hung check failed, possible deadlock found deadlock (same family) #105905 / #101680 (ours, open)
Stress test (arm_msan) / Hung check failed, possible deadlock found deadlock (same family) #105905 / #101680 (ours, open)
PR rollup of the Stress hung-check failures above (no separate cause)

This PR changes only DatabaseReplicated::tryGetCluster/tryGetAllGroupsCluster/tryGetReplicasInfo catch log levels + adds stateless regression tests — none of which touch the thread-pool/blob-removal shutdown paths the hung check exercises. The three Stress hung-check failures belong to the chronic shutdown-deadlock family already being addressed by #105905 and #101680. Ready for review.

Session id: cron:our-pr-ci-monitor:20260625-170000

alexey-milovidov and others added 2 commits June 28, 2026 23:01
tryGetReplicasInfo read `ddl_worker` for `unsynced_after_recovery` outside
the `ddl_worker_mutex` block (the lock guarding the adjacent `recovery_time`
read ended one statement earlier). `shutdown()` resets and frees `ddl_worker`
under `ddl_worker_mutex`, so a `SELECT is_active/unsynced_after_recovery/...
FROM system.clusters` running concurrently with `DROP/DETACH DATABASE` could
dereference the freed worker -> heap-use-after-free.

ThreadSanitizer caught this on PR ClickHouse#105149's flaky check (TSan): write of size 8
in `~DatabaseReplicatedDDLWorker` (DatabaseReplicated::shutdown, ddl_worker =
nullptr) vs read in `DatabaseReplicatedDDLWorker::isUnsyncedAfterRecovery`
reached from `StorageSystemClusters::fillData` -> `tryGetReplicasInfo`.

Move the `isUnsyncedAfterRecovery()` read into the same `ddl_worker_mutex`
block as the `recovery_time` read. Behavior is unchanged: both members are
read only when `ddl_worker` is non-null; `unsynced_after_recovery` stays false
when it is null. All writes to `ddl_worker` already hold `ddl_worker_mutex`,
so the read is now race-free.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh clickhouse-gh Bot added pr-bugfix Pull request with bugfix, not backported by default and removed pr-ci labels Jun 29, 2026
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate for the new commit (e2317cabc0d, heap-use-after-free fix on ddl_worker in tryGetReplicasInfo):

Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? The race window is between releasing ddl_worker_mutex and the unlocked ddl_worker->isUnsyncedAfterRecovery() read. Confirmed locally that a SELECT is_active/unsynced_after_recovery/recovery_time FROM system.clusters reaches that exact unlocked access (verified with an env-gated park: the reader sits at the access for ~5s). The race itself was caught deterministically by this PR's amd_tsan flaky check (STID 5804-51e6).
b Root cause explained? tryGetReplicasInfo reads recovery_time under ddl_worker_mutex, but the very next line reads ddl_worker again for unsynced_after_recovery after the lock_guard scope has ended. shutdown() does ddl_worker = nullptr (frees ~DatabaseReplicatedDDLWorker) under ddl_worker_mutex. A concurrent DROP/DETACH DATABASE therefore frees the worker while the unlocked read dereferences it: heap-use-after-free.
c Fix matches root cause? Yes. The fix moves the isUnsyncedAfterRecovery() read into the same ddl_worker_mutex block as recovery_time, so every access to ddl_worker holds the lock that guards its reset/free. No symptom-guarding.
d Test intent preserved / new tests added? Existing 04252/04254/04278 unchanged and still pass. No new flaky concurrency test is added: a TSan-race test that races CREATE/DROP vs system.clusters would itself be non-deterministic (and the flaky check runs it 50x). The tryGetReplicasInfo path is already exercised functionally by 04278; the thread-safety fix is validated by TSan CI + the local TSan workload below.
e Both directions demonstrated? Without fix: amd_tsan flaky check reported the data race (write in ~DatabaseReplicatedDDLWorker via shutdown() vs read in isUnsyncedAfterRecovery via tryGetReplicasInfo:588). With fix: rebuilt the amd_tsan binary (Build ID changed) and ran the same create / concurrent-system.clusters-read / DROP workload repeatedly under TSan: no race, no deadlock; 04252/04254/04278 all pass.
f Fix is general across code paths? Audited all ddl_worker dereferences in DatabaseReplicated.cpp. Line 588 (now fixed) was the only system.clusters-reachable read performed outside ddl_worker_mutex; all other accesses are already under the mutex, in debug-only chassert, or in enqueue/wait paths that the design does not run concurrently with shutdown. No sibling unlocked read remains.
g Fix generalizes across inputs? N/A for input types - this is a concurrency/lifetime fix, not a value/type-dependent code path. The race does not depend on column types or settings; it depends only on the concurrent DROP/DETACH timing, which the fix closes unconditionally.
h Backward compatible? Yes. No setting, on-disk/wire/replication format, or default changes. Pure in-process locking change; behavior of the produced ReplicasInfo is identical.
i Invariants and contracts preserved? Yes. The invariant "every read/write of ddl_worker is performed under ddl_worker_mutex" now holds on this path too. Semantics preserved: both members read only when ddl_worker is non-null; unsynced_after_recovery stays false when null (same as the old short-circuit ddl_worker && ddl_worker->isUnsyncedAfterRecovery()). Lock scope is only widened by one read of the already-held object, no new lock ordering introduced.

Session id: cron:clickhouse-worker-slot-1:20260629-003300

@clickhouse-gh

clickhouse-gh Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 25/35 (71.43%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good.

@alexey-milovidov alexey-milovidov self-assigned this Jun 29, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 29, 2026
Merged via the queue into ClickHouse:master with commit 6802478 Jun 29, 2026
174 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants