Fix flaky 01293_show_clusters: lower log level in DatabaseReplicated::tryGetCluster#105149
Conversation
|
cc @tavplubix @alesapin — could you review this? Three log-level changes in |
|
Workflow [PR], commit [e2317ca] Summary: ✅
AI ReviewSummaryThis PR lowers the log level for expected transient Final VerdictStatus: ✅ Approve |
…trap cleanup Per @clickhouse-gh[bot] review feedback on PR ClickHouse#105149: 1. `src/Databases/DatabaseReplicated.cpp`: previously the `catch (...)` blocks in `tryGetCluster`, `tryGetAllGroupsCluster`, and `tryGetReplicasInfo` downgraded *every* swallowed exception to `LogsLevel::information`. That includes unexpected exceptions from `getClusterImpl` (malformed Keeper payloads, logic bugs, ...) which we still want operators to notice. Restrict the downgrade to the expected coordination errors (`KEEPER_EXCEPTION` / `ALL_CONNECTION_TRIES_FAILED`); anything else stays at the default `error` level via the plain `tryLogCurrentException(log)`. 2. `tests/queries/0_stateless/04252_database_replicated_system_clusters_log_level.sh`: move the database / metadata / Keeper cleanup into a `trap cleanup EXIT` handler. The test deliberately leaves the `Replicated` database in a non-droppable state (it deletes `${ZK_PATH}/replicas`), so without the trap a single `exit 1` on assertion failure would leave the orphan attached and the next run's `CREATE DATABASE` would fail.
|
Thanks @clickhouse-gh — both review points addressed in a8ce315:
(Session: cron:clickhouse-ci-task-worker:20260517-181500) |
|
This was fixed by #105146. Let's update the branch. |
|
Addressed the bot's
New regression test Bidirectional verification (local debug build, clang-21):
CI starting now. |
|
@groeneai Check the CI failure and address the unresolved comments. |
`NO_ACTIVE_REPLICAS` is `Code: 254`, not `Code: 279` (which is `ALL_CONNECTION_TRIES_FAILED`). Updates two comment references in `tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh` so the explanation matches the exception branch actually exercised. Comment-only change. No behavior change. Addresses ClickHouse#105149 (comment)...
|
Hi @tuanpach, thanks for the ping. Status update: Bot's inline comment ( CI failure analysis (commit
All other 200+ jobs pass: builds (all archs), Fast test, Style check, Stateless tests (debug/asan/tsan/msan/s3/distributed/old-analyzer/etc.), Integration tests (all 6 shards on each sanitizer), Stress tests, AST fuzzer, BuzzHouse, Unit tests, Upgrade check, libFuzzer, Code Review. CIDB query for reference: SELECT check_name, count() AS hits
FROM default.checks
WHERE test_name = '02180_group_by_lowcardinality'
AND test_status IN ('FAIL','ERROR')
AND check_start_time > now() - INTERVAL 30 DAY
GROUP BY check_name ORDER BY hits DESC;
-- Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel) 184
-- Fast test 1I will track Ready for review on top of the new commit. Let me know if you want anything else changed before merge. |
|
Quick update on the CI failure: I missed that Pushed a master merge as commit |
`clickhouse-gh[bot]` review on PR ClickHouse#105149 noted that the existing regression tests (04252, 04254) use `SELECT count() FROM system.clusters`, which only reads metadata and never invokes the lambda that calls `DatabaseReplicated::tryGetReplicasInfo`. `StorageSystemClusters::writeCluster` only calls `replicas_info_getter` when the query reads one of the replica state columns: `is_active`, `unsynced_after_recovery`, `replication_lag`, `recovery_time`, or `is_shared_catalog_cluster`. So the `tryGetReplicasInfo` log-level change in this PR had no test coverage. Add `04278_database_replicated_system_clusters_replicas_info.sh` which: 1. Creates a `Replicated` database (so `/replicas` is populated and `tryGetCluster` returns a valid cluster on subsequent calls). 2. Primes `tryGetCluster` so the cluster is cached. 3. Deletes `${ZK_PATH}/max_log_ptr` so the `tryGet(paths)` in `tryGetReplicasInfo` returns `ZNONODE` for `paths[0]` and the function throws `Coordination::Exception(ZNONODE)`, which maps to `KEEPER_EXCEPTION` (`Code: 999`). 4. Queries `SELECT is_active FROM system.clusters` at the default `send_logs_level=warning` and asserts no `DatabaseReplicated.*Code:` leaks to client stderr. Verified bidirectionally on local debug build: - Without the fix (catch reverted to bare `tryLogCurrentException(log)`) the test FAILS with the exact expected `<Error> DatabaseReplicated ... Code: 999. Coordination::Exception: Coordination error: No node. (KEEPER_EXCEPTION)` line leaking to client stderr. - With the fix the test PASSES, 10/10 runs with full settings randomization, alongside 04252 and 04254 (30/30 across all three).
…ception `DatabaseReplicated::tryGetCluster` (and its siblings `tryGetAllGroupsCluster` and `tryGetReplicasInfo`) catch exceptions thrown while reading the database's `/replicas` znode and return nullptr so callers (e.g. `system.clusters` filling) silently skip the database. The catch block was logging the swallowed exception via `tryLogCurrentException(log)`, which defaults to `LogsLevel::error`. In CI the test runner sets `send_logs_level=warning`, so any exception logged at error level is forwarded to the client stderr. When two parallel stateless tests are creating/dropping their Replicated test databases, an unrelated `SHOW CLUSTERS` (e.g. `01293_show_clusters`) that happens to iterate over the in-flight Replicated database picks up the now-leaked `Code: 999. Coordination::Exception: No node, path /clickhouse/databases/.../replicas` or `Code: 279. No active replicas` in its stderr and is reported as failed -- even though every assertion passes and the auto-rerun budget gets 73/73 passes. Lower the level to `LogsLevel::information` with a short context message in all three catches. The message is still visible in normal server logs (information is above the production logger default), but it is below the `warning` threshold so it no longer leaks into client stderr. This matches the pattern already used elsewhere in the file (e.g. line 2168, `Async loading failed` logged at `warning`). A new stateless regression test `04252_database_replicated_system_clusters_log_level.sh` reproduces the failure path deterministically: it creates a Replicated database, deletes its `/replicas` znode via `keeper-client`, queries `system.clusters` at `send_logs_level=warning`, and asserts that no `DatabaseReplicated.*Code: (279|999)` text leaks into the captured client stderr. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105047&sha=dfa4712b3435d65558081aeb9ce3bc19f6d84a07&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29
…trap cleanup Per @clickhouse-gh[bot] review feedback on PR ClickHouse#105149: 1. `src/Databases/DatabaseReplicated.cpp`: previously the `catch (...)` blocks in `tryGetCluster`, `tryGetAllGroupsCluster`, and `tryGetReplicasInfo` downgraded *every* swallowed exception to `LogsLevel::information`. That includes unexpected exceptions from `getClusterImpl` (malformed Keeper payloads, logic bugs, ...) which we still want operators to notice. Restrict the downgrade to the expected coordination errors (`KEEPER_EXCEPTION` / `ALL_CONNECTION_TRIES_FAILED`); anything else stays at the default `error` level via the plain `tryLogCurrentException(log)`. 2. `tests/queries/0_stateless/04252_database_replicated_system_clusters_log_level.sh`: move the database / metadata / Keeper cleanup into a `trap cleanup EXIT` handler. The test deliberately leaves the `Replicated` database in a non-droppable state (it deletes `${ZK_PATH}/replicas`), so without the trap a single `exit 1` on assertion failure would leave the orphan attached and the next run's `CREATE DATABASE` would fail.
`getClusterImpl` throws `ErrorCodes::NO_ACTIVE_REPLICAS` (code 254) when the `/replicas` znode exists but is empty -- i.e. the first replica is not fully created yet or the last replica was just dropped. This is the same concurrent-lifecycle window the previous commit (a8ce315) already handles for `KEEPER_EXCEPTION` (999) and `ALL_CONNECTION_TRIES_FAILED` (279), so it should be logged at `<Information>` for the same reason: the catch returns nullptr to the caller (which skips the database) and the noisy `<Error>` log line otherwise leaks to client stderr at the default `send_logs_level=warning` and turns `system.clusters` queries flaky. Apply the same predicate update to both `tryGetCluster` and `tryGetAllGroupsCluster` (both reach `getClusterImpl`). `tryGetReplicasInfo` does not call `getClusterImpl` and cannot throw `NO_ACTIVE_REPLICAS`; its catch is left unchanged. Companion regression test 04254 mirrors 04252's setup but recreates `/replicas` as an empty znode (`rmr` + `touch`) so the next `getClusterImpl` call hits the `NO_ACTIVE_REPLICAS` branch deterministically. Verified locally: without this commit the test FAILS with "Code: 254. DB::Exception: No replicas of database ... (NO_ACTIVE_REPLICAS)" leaked to client stderr; with this commit the test PASSES (30 runs with randomization enabled, 30/30 OK).
`NO_ACTIVE_REPLICAS` is `Code: 254`, not `Code: 279` (which is `ALL_CONNECTION_TRIES_FAILED`). Updates two comment references in `tests/queries/0_stateless/04254_database_replicated_system_clusters_no_active_replicas.sh` so the explanation matches the exception branch actually exercised. Comment-only change. No behavior change. Addresses ClickHouse#105149 (comment)...
`clickhouse-gh[bot]` review on PR ClickHouse#105149 noted that the existing regression tests (04252, 04254) use `SELECT count() FROM system.clusters`, which only reads metadata and never invokes the lambda that calls `DatabaseReplicated::tryGetReplicasInfo`. `StorageSystemClusters::writeCluster` only calls `replicas_info_getter` when the query reads one of the replica state columns: `is_active`, `unsynced_after_recovery`, `replication_lag`, `recovery_time`, or `is_shared_catalog_cluster`. So the `tryGetReplicasInfo` log-level change in this PR had no test coverage. Add `04278_database_replicated_system_clusters_replicas_info.sh` which: 1. Creates a `Replicated` database (so `/replicas` is populated and `tryGetCluster` returns a valid cluster on subsequent calls). 2. Primes `tryGetCluster` so the cluster is cached. 3. Deletes `${ZK_PATH}/max_log_ptr` so the `tryGet(paths)` in `tryGetReplicasInfo` returns `ZNONODE` for `paths[0]` and the function throws `Coordination::Exception(ZNONODE)`, which maps to `KEEPER_EXCEPTION` (`Code: 999`). 4. Queries `SELECT is_active FROM system.clusters` at the default `send_logs_level=warning` and asserts no `DatabaseReplicated.*Code:` leaks to client stderr. Verified bidirectionally on local debug build: - Without the fix (catch reverted to bare `tryLogCurrentException(log)`) the test FAILS with the exact expected `<Error> DatabaseReplicated ... Code: 999. Coordination::Exception: Coordination error: No node. (KEEPER_EXCEPTION)` line leaking to client stderr. - With the fix the test PASSES, 10/10 runs with full settings randomization, alongside 04252 and 04254 (30/30 across all three).
1a1eb1e to
e86e6f6
Compare
|
CI finished on
|
|
Dear @tuanpach, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
CI finish ledger — e86e6f6CI has fully finished (Finish Workflow passed; last checks completed 2026-06-23). Every red check has an owner below; only
This PR changes only Session id: cron:our-pr-ci-monitor:20260625-170000 |
tryGetReplicasInfo read `ddl_worker` for `unsynced_after_recovery` outside the `ddl_worker_mutex` block (the lock guarding the adjacent `recovery_time` read ended one statement earlier). `shutdown()` resets and frees `ddl_worker` under `ddl_worker_mutex`, so a `SELECT is_active/unsynced_after_recovery/... FROM system.clusters` running concurrently with `DROP/DETACH DATABASE` could dereference the freed worker -> heap-use-after-free. ThreadSanitizer caught this on PR ClickHouse#105149's flaky check (TSan): write of size 8 in `~DatabaseReplicatedDDLWorker` (DatabaseReplicated::shutdown, ddl_worker = nullptr) vs read in `DatabaseReplicatedDDLWorker::isUnsyncedAfterRecovery` reached from `StorageSystemClusters::fillData` -> `tryGetReplicasInfo`. Move the `isUnsyncedAfterRecovery()` read into the same `ddl_worker_mutex` block as the `recovery_time` read. Behavior is unchanged: both members are read only when `ddl_worker` is non-null; `unsynced_after_recovery` stays false when it is null. All writes to `ddl_worker` already hold `ddl_worker_mutex`, so the read is now race-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pre-PR validation gate for the new commit ( Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-1:20260629-003300 |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 25/35 (71.43%) | Lost baseline coverage: none · Uncovered code |

Related: #67732
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a heap-use-after-free in
DatabaseReplicated::tryGetReplicasInfo(reachable by queryingsystem.clustersfor replica-state columns such asis_active/unsynced_after_recovery/recovery_timewhile aReplicateddatabase is being dropped or detached), and lowered the log level of swallowed Keeper exceptions inDatabaseReplicated::tryGetCluster,tryGetAllGroupsCluster, andtryGetReplicasInfofrom<Error>to<Information>. The exceptions are caught and ignored by design (callers such assystem.clusterstreat the affected database as transiently unavailable and skip it), but previously the error-level log line was forwarded to client stderr at the defaultsend_logs_level=warning, causing unrelated tests touchingsystem.clusters(e.g.01293_show_clusters) to appear failed on parallel stateless lanes.Description
This PR fixes two related issues in
DatabaseReplicated'ssystem.clustersread path.1. Flaky
01293_show_clusters(log level) - original PR scopeDatabaseReplicated::tryGetClusterand the siblingtryGetAllGroupsCluster/tryGetReplicasInfointentionally swallow exceptions thrown while reading the database's/replicasKeeper state. The existing comment intryGetClusterexplains why:The exception is swallowed (
clusterstays null, callers skip the database), but the catch block was logging it viatryLogCurrentException(log), which defaults toLogsLevel::error. The CI test runner forwards server logs atsend_logs_level=warningand above to the client stderr, so the swallowed exception still showed up as test stderr noise, reporting unrelatedsystem.clustersqueries (e.g.01293_show_clusters) as failed on arm_binary / amd_debug / s3 lanes.The three catches now log via
tryLogCurrentException(log, "...", LogsLevel::information)for the expected coordination/connection codes (KEEPER_EXCEPTION,ALL_CONNECTION_TRIES_FAILED,NO_ACTIVE_REPLICAS); anything else stays at the default error level so genuine problems remain visible to operators.Regression tests:
04252_database_replicated_system_clusters_log_level.sh- deletes/replicas, asserts noCode: (279|999)leaks to client stderr.04254_database_replicated_system_clusters_no_active_replicas.sh- empty/replicas, asserts noCode: 254(NO_ACTIVE_REPLICAS) leak.04278_database_replicated_system_clusters_replicas_info.sh- deletes/max_log_ptr, exercises thetryGetReplicasInfopath specifically.2. Data race / heap-use-after-free on
ddl_workerintryGetReplicasInfoThe TSan flaky check on this PR caught a real, pre-existing data race in
tryGetReplicasInfo:In
tryGetReplicasInfo, therecovery_timeread was correctly performed underddl_worker_mutex, but the immediately followingunsynced_after_recovery = ddl_worker && ddl_worker->isUnsyncedAfterRecovery()read happened outside that lock (thelock_guardscope had already ended).shutdown()resets and freesddl_workerunderddl_worker_mutex, so aSELECT is_active / unsynced_after_recovery / recovery_time FROM system.clustersrunning concurrently withDROP/DETACH DATABASEcould dereference the freed worker - a heap-use-after-free (also reachable in release builds, not just under sanitizers).The fix moves the
isUnsyncedAfterRecovery()read into the sameddl_worker_mutexblock as therecovery_timeread. Behavior is unchanged: both members are read only whenddl_workeris non-null, andunsynced_after_recoverystaysfalsewhen it is null. All writes toddl_workeralready holdddl_worker_mutex, so the read is now race-free.CI provenance for the race (this PR's flaky check):
Stateless tests (amd_tsan, flaky check), STID 5804-51e6Verification:
SELECT is_active/unsynced_after_recovery/recovery_time FROM system.clusters,DROP DATABASE, repeated under TSan) runs clean - no data race, no deadlock.ddl_workeraccess intryGetReplicasInfo(the affected line) before the fix.Related
tests/queries/0_stateless/03206_replication_lag_metric.shworked around the same stderr noise by settingCLICKHOUSE_CLIENT_SERVER_LOGS_LEVEL=none. That workaround can be revisited in a follow-up now that the underlying log-level mismatch is fixed at the source.Version info
26.7.1.276