Speed up server shutdown for databases with many tables#104969
Speed up server shutdown for databases with many tables#104969jaymebrd wants to merge 13 commits into
Conversation
|
@groeneai can you check what prevents merging and help? |
|
📊 Cloud Performance Report
This is a sync PR mirroring public #104969; both diffs are identical and only add a parallel table-shutdown thread pool that runs during server stop, never during query execution. Because the query hot path is untouched, none of the flagged ClickBench/TPC-H deltas — improvements or regressions — can be attributed to this change, so 8 of them are downgraded as off-path variance. Two TPC-H regressions (Q7 +800%, Q8 +30%) are large and consistent enough to trip the hard-trust override and are retained, but given the shutdown-only scope they most likely reflect environment/measurement noise and warrant a manual look rather than code review. clickbenchFlagged queries (6 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🔴 2 regressed · Flagged queries (4 of 22)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. Debug info
|
|
@serxa diagnosis for #104969 (head Everything else is green:
Nothing on the author's side looks blocking. Happy to dig further once the sync logs are visible. |
`AddDefaultDatabaseVisitor::visit(ASTSelectQuery &, ASTPtr &)` guards its WITH-traversal on `select.recursive_with` but then dereferences `select.with()->children` unconditionally. The parser sets the flag together with a non-empty WITH expression, but AST mutation after parsing (server-side `BuzzHouse` fuzzing, in particular) can leave the flag set while the expression is absent, producing a null pointer dereference inside `IStorageCluster::read`. Surfaced by `BuzzHouse (amd_tsan)` as `STID: 3440-3393` on PRs ClickHouse#104969 and ClickHouse#105332 (two unrelated PRs in five days). Guard the dereference with the actual presence of the WITH expression, matching the pattern already used in `ASTSelectQuery::formatImpl` and `QueryTreeBuilder::buildSelectQuery`. Valid `WITH RECURSIVE x AS ...` queries are unaffected. Adds a `gtest` that constructs the inconsistent AST shape, runs the visitor, and asserts no fatal failure; the test reliably aborts on the previous code with the `boost::intrusive_ptr` null-deref assertion. Closes ClickHouse#105370 CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105332&sha=ac51d1463d5def7f7ba0354f581e078cbf10d6c6&name_0=PR&name_1=BuzzHouse%20%28amd_tsan%29 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Hi @serxa, gentle ping. PR #104969 is green on public CI and all review threads are resolved; the only thing still blocking merge is the internal |
|
@groeneai Let's merge master into this branch to take new fixes into account and into sync PR |
|
@serxa branch is ~3.9k commits behind master but @jaymebrd could you merge master into |
|
@groeneai check recent CI failures? do we have fixes for them? are they merged in master? |
|
@serxa Triaged the 5 public failures at head 1. Integration tests (4 shards: asan_ubsan / msan / tsan / arm_binary) - 2. BuzzHouse (amd_debug) - 3. CH Inc sync - internal check, no public target_url; only CH Inc can inspect or re-run (unchanged from prior runs). Recommendation: nothing on the author side blocks merge. The shutdown code is not implicated by any of the five. Re-running BuzzHouse + the Integration shards should clear them; the only real gate remains the internal sync. Public CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104969&sha=39a0842660da13a14f507feaa9d2f9f6a3edaf24&name_0=PR |
|
@groeneai Failed test test_format_cannot_allocate_thread/test.py::test_it looks suspicious. Is it related to the changes in this PR? |
|
@serxa Yes, it looks related to this PR. I need to correct my 06-18 triage: I lumped it in with a pre-existing teardown race, but the hard signature on the current runs is specific to this branch. Evidence:
Mechanism: a SystemLog flush thread lazily CREATES its backing MergeTree table during shutdown; the So it is a latent shutdown-ordering fragility (system logs can lazily create their tables during shutdown and reach a torn-down static pool), and this PR's parallel shutdown is the first thing to surface it. Possible directions: skip lazy system-log table creation when the catalog is shutting down (just drop the unflushed entries, since the table was never materialized), or make sure the new parallel-shutdown work is fully joined before system logs flush so |
|
@groeneai Shutting down multiple interdependent pools is tricky. In easy cases, it could be resolved just by the proper ordering. Try to prepare the fix in a separate PR before we can merge this PR. The goal is to add necessary dependencies between startup and shutdown procedures, such that issue you found here is resolved. I hope you will not face circular dependencies; be careful with that (if nothing can prevent them, report back here) |
|
@serxa Will do. Preparing the fix in a separate PR off master. Plan: make SystemLogs teardown happen-before the static ActivePartsLoading pool teardown on all paths. Concretely, in No circular dependency expected: system logs depend on the static pools and on the |
741fbd6 to
9d7ff6e
Compare
|
@serxa prepared the fix in a separate PR off master: #108417. No circular dependency. Two parts: |
|
@groeneai fix has been merged, update this PR please |
|
@serxa #108417 is merged to master (commit e5a7294, 2026-06-28 16:17 UTC), so the shutdown-crash fix is now on master. Two things about updating this PR: 1. The branch update has to be done by @jaymebrd or a CH Inc maintainer. I'm an external contributor with no write access to jaymebrd's fork, and "Allow edits from maintainers" only grants push to CH Inc maintainers, not to other external contributors. 2. It is now a real conflict, not a clean "Update branch". #108417 touched the same two files this PR does:
So merging master needs manual conflict resolution in those two files. 3. On the clickhouse-gh[bot] shutdown-ordering finding: the system-log-race half (the One thing to carry forward when resolving the |
DatabaseWithOwnTablesBase ran shutdown sequentially per table, which made server shutdown take many minutes on clusters with thousands of tables. Fan flushAndPrepareForShutdown and flushAndShutdown across a dedicated thread pool (database_catalog_shutdown_table_concurrency, default 16) and log per-database elapsed time. The parallel path is exercised by a new integration test gated behind a test-only failpoint.
The lambda captured db_name and database_uuid only to assert a database-level invariant. Under NDEBUG the assert is a no-op, so clang flagged both as unused captures and -Werror failed Fast test. Move the assertion out of the per-table lambda — it only needs to run once per database — and drop the unused captures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DatabaseWithOwnTablesBase is also used by Memory-backed databases like INFORMATION_SCHEMA and system in clickhouse-local, where getUUID() is Nil but the snapshot is not empty. The previous unconditional chassert would trip on those in debug/ASan builds. Gate the check on whether the tables themselves carry UUIDs, mirroring the original assertion that was guarded by table_id.hasUUID(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If the setting is 0, the pool accepts tasks but has no worker to run them, so DatabaseWithOwnTablesBase::shutdown blocks forever on waitForAllToFinishAndRethrowFirstError. Bump to 1 at init time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the predefined-database check into the fiu_do_on block so the test-only logic lives inside the failpoint scope rather than being hoisted into shutdown()'s top-level variables. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9d7ff6e to
b36e6bb
Compare
enqueueAndKeepTrack can throw before the task's try/catch runs, skipping per-table UUID mapping removal and the tail cleanup. Run the task inline on schedule failure.
Both can throw (removeUUIDMapping has an explicit LOGICAL_ERROR throw if the mapping is already gone). When they're outside the inner try, the throw escapes into the task future, which waitForAllToFinish discards. Move them inside the try so record_error captures them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move getStorageID, flushAndShutdown, and removeUUIDMapping into three separate try blocks. UUID mapping cleanup must run even when flushAndShutdown throws, otherwise the catalog keeps the UUID->StoragePtr mapping past Context::shutdown and the storage is destroyed only at process exit (after loggers/static pools are gone). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capture shared error state as shared_ptr<ErrorRecorder> and the logger by value, dropping `this` and by-ref captures. An orphan task — rare case where the pool accepts the job but tasks.emplace_back throws — can outlive shutdown() and would otherwise dangle into the stack frame. Also read recorder->first_error under the recorder's mutex in the tail, since an orphan task can still write to it after waitForAllToFinish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
enqueueAndKeepTrack schedules the task, then tracks it with tasks.emplace_back. If that emplace_back reallocated and threw after the task was already scheduled, the catch ran the task inline a second time while the scheduled copy stayed untracked (and unwaited). Reserve the tracking vector up front so emplace_back never reallocates; then a throw from enqueueAndKeepTrack always means the task was never scheduled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Parallelizing flushAndPrepareForShutdown was unsafe: some tables flush into other tables there (a Buffer table writes to its destination), and running that concurrently could hit a destination already in shutdown-prepared read-only mode and silently drop rows. Keep the prepare phase sequential (unchanged from master), and parallelize only the raw shutdown() phase — the expensive, table-independent drain (ZooKeeper sessions, interserver endpoints, background pools) that was the actual bottleneck for databases with many tables. Since prepare already ran, call shutdown() directly instead of flushAndShutdown() to avoid re-running the flush. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 105/110 (95.45%) · Uncovered code |
serxa
left a comment
There was a problem hiding this comment.
The rework feels okay to me, but now I think we may need one more opinion on this.
@evillique PTAL, maybe we need something similar in SharedCatalog, or that's completely unrelated/orthogonal?

Table shutdown runs sequentially per table, which can make server shutdown take many minutes to 1hr+ on clusters with many thousands of tables.
This PR parallelises
DatabaseWithOwnTablesBase::shutdown:src/Databases/DatabasesCommon.cpp: runflushAndPrepareForShutdownandflushAndShutdownacross tables concurrently via a dedicated thread pool. Logs per-database elapsed time.The speedup is best measured by SIGTERM-to-exit wall time on many-table clusters.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Speed up server shutdown for databases with many tables by running per-table shutdown in parallel. Controlled by the new database_catalog_shutdown_table_concurrency setting (default 16).