Fix stuck shutdown in server and local and join all global pool threads in all binaries#107291
Conversation
|
Joining threads before exiting from main is required to avoid such issues. |
| { | ||
| static StaticThreadPool instance("FormatParsingThreadPool", CurrentMetrics::FormatParsingThreads, CurrentMetrics::FormatParsingThreadsActive, CurrentMetrics::FormatParsingThreadsScheduled); | ||
| return instance; | ||
| #define SHUTDOWN_STATIC_THREAD_POOL(SUFFIX, NAME, METRIC) get##SUFFIX##ThreadPool().shutdown(); |
There was a problem hiding this comment.
shutdownAll is only called from a hand-picked subset of mainEntry functions. Several registered binaries can still create GlobalThreadPool workers and then return without GlobalThreadPool::shutdown: mainEntryClickHouseClient reaches ClientBase::initOutputFormat -> getOutputFormatParallelIfPossible -> ParallelFormattingOutputFormat/ThreadFromGlobalPool; mainEntryClickHouseKeeperClient, mainEntryClickHouseZooKeeperDumpTree, mainEntryClickHouseZooKeeperRemoveByList, and mainEntryClickHouseExtractFromConfig create zkutil::ZooKeeper, whose constructor starts send/receive ThreadFromGlobalPool threads; mainEntryClickHouseKeeperUtils can do the same through parallel output.
Those idle global-pool OS threads are then still joined from the static GlobalThreadPool destructor, so the static-destruction race this PR fixes for server/local remains in those modes. Please either install one top-level scope guard in programs/main.cpp and programs/keeper/keeper_main.cpp, or add equivalent guards to every mainEntry that can reach GlobalThreadPool.
azat
left a comment
There was a problem hiding this comment.
Looks good
Do you have a stacktrace as an example?
Of getting stuck? I added a regression test that reproduces it. |
Yes, but I just wanted to look at the two stacktraces w/o running test manually |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 73/87 (83.91%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code |
|
@azat, all threads stack traces: https://pastila.clickhouse.com/?008e7a5e/642ef6f364286499e10b06644c222b91#lABD0RXIc2wNqXKpHuds+w==GCM TLDR: jemalloc threads, main thread waiting on join, one thread stuck in the |
|
📊 Cloud Performance Report ✅ AI verdict: This PR only changes process-shutdown ordering — it destroys the static thread pools before joining the global thread pool, plus a macro refactor of the pool getters. That code runs at teardown, not during query execution, so it cannot plausibly speed up any SELECT. The four flagged improvements (clickbench Q22/Q23, TPC-H Q7/Q20) all sit off the changed path and several show noisy measurements, so we read them as run-to-run variance rather than a real effect and downgraded them to not-sure. No actionable performance change is attributable to this PR. clickbenchFlagged queries (2 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_officialFlagged queries (2 of 22)q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. Debug info
|
PR ClickHouse#107291 introduced StaticThreadPool::shutdownAll() and placed it before GlobalThreadPool::instance().shutdown() in Server::main's SCOPE_EXIT. Background threads (notably SystemLog flush) run on the global pool and can lazily call into static pools — e.g. via SystemLog::prepareTable -> StorageMergeTree ctor -> loadDataParts -> getActivePartsLoadingThreadPool().get(). Tearing down the static pools first races those callers, producing "The MergeTreePartsLoaderThreadPool is not initialized" and TSan races on the underlying unique_ptr. Swap the order so background threads drain first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix the stuck shutdown bug on server and local due to static thread pools keeping idle threads indefinitely if
max_*_thread_pool_free_size > 0.Join threads from the global thread pool before exiting from main in all binaries. As a result, all static object accesses by threads happen-before static object destruction. Currently, we only do it for the server and keeper binaries.
Version info
26.6.1.924