Fix stuck shutdown in server and local and join all global pool threads in all binaries by mstetsyuk · Pull Request #107291 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix stuck shutdown in server and local and join all global pool threads in all binaries#107291

Merged
mstetsyuk merged 2 commits into
masterfrom
static-thread-pool-shutdownall
Jun 17, 2026
Merged

Fix stuck shutdown in server and local and join all global pool threads in all binaries#107291
mstetsyuk merged 2 commits into
masterfrom
static-thread-pool-shutdownall

Conversation

@mstetsyuk

@mstetsyuk mstetsyuk commented Jun 11, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix the stuck shutdown bug on server and local due to static thread pools keeping idle threads indefinitely if max_*_thread_pool_free_size > 0.

Join threads from the global thread pool before exiting from main in all binaries. As a result, all static object accesses by threads happen-before static object destruction. Currently, we only do it for the server and keeper binaries.

Version info

  • Merged into: 26.6.1.924

@clickhouse-gh

clickhouse-gh Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 11, 2026
@mstetsyuk

Copy link
Copy Markdown
Member Author

Joining threads before exiting from main is required to avoid such issues.

{
static StaticThreadPool instance("FormatParsingThreadPool", CurrentMetrics::FormatParsingThreads, CurrentMetrics::FormatParsingThreadsActive, CurrentMetrics::FormatParsingThreadsScheduled);
return instance;
#define SHUTDOWN_STATIC_THREAD_POOL(SUFFIX, NAME, METRIC) get##SUFFIX##ThreadPool().shutdown();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This lifetime fix is still incomplete because shutdownAll is only called from a hand-picked subset of mainEntry functions. Several registered binaries can still create GlobalThreadPool workers and then return without GlobalThreadPool::shutdown: mainEntryClickHouseClient reaches ClientBase::initOutputFormat -> getOutputFormatParallelIfPossible -> ParallelFormattingOutputFormat/ThreadFromGlobalPool; mainEntryClickHouseKeeperClient, mainEntryClickHouseZooKeeperDumpTree, mainEntryClickHouseZooKeeperRemoveByList, and mainEntryClickHouseExtractFromConfig create zkutil::ZooKeeper, whose constructor starts send/receive ThreadFromGlobalPool threads; mainEntryClickHouseKeeperUtils can do the same through parallel output.

Those idle global-pool OS threads are then still joined from the static GlobalThreadPool destructor, so the static-destruction race this PR fixes for server/local remains in those modes. Please either install one top-level scope guard in programs/main.cpp and programs/keeper/keeper_main.cpp, or add equivalent guards to every mainEntry that can reach GlobalThreadPool.

@azat azat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Do you have a stacktrace as an example?

@mstetsyuk

Copy link
Copy Markdown
Member Author

Do you have a stacktrace as an example?

Of getting stuck? I added a regression test that reproduces it.

@azat

azat commented Jun 12, 2026

Copy link
Copy Markdown
Member

Of getting stuck? I added a regression test that reproduces it.

Yes, but I just wanted to look at the two stacktraces w/o running test manually

@clickhouse-gh

clickhouse-gh Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.70% 84.70% +0.00%
Functions 92.40% 92.40% +0.00%
Branches 77.30% 77.30% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 73/87 (83.91%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code

Full report · Diff report

@mstetsyuk

Copy link
Copy Markdown
Member Author

@azat, all threads stack traces: https://pastila.clickhouse.com/?008e7a5e/642ef6f364286499e10b06644c222b91#lABD0RXIc2wNqXKpHuds+w==GCM

TLDR: jemalloc threads, main thread waiting on join, one thread stuck in the parent_pool.new_job_or_shutdown.wait call

@clickhouse-gh

clickhouse-gh Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR only changes process-shutdown ordering — it destroys the static thread pools before joining the global thread pool, plus a macro refactor of the pool getters. That code runs at teardown, not during query execution, so it cannot plausibly speed up any SELECT. The four flagged improvements (clickbench Q22/Q23, TPC-H Q7/Q20) all sit off the changed path and several show noisy measurements, so we read them as run-to-run variance rather than a real effect and downgraded them to not-sure. No actionable performance change is attributable to this PR.

clickbench

⚠️ 2 inconclusive

Flagged queries (2 of 43)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
⚠️ 22 not_sure 334 256 -23.4% <0.0001 PR only reorders thread-pool teardown at process shutdown; cannot affect query execution. Source CV ~78% — noise.
⚠️ 23 not_sure 297 202 -32.0% <0.0001 Shutdown-ordering-only change; nothing here touches the SELECT hot path. -32% is off-path, not a PR effect.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

⚠️ 2 inconclusive

Flagged queries (2 of 22)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
⚠️ 7 not_sure 113 70 -38.0% <0.0001 PR adds static thread-pool shutdown at teardown only; unrelated to query path. History is noisy too.
⚠️ 20 not_sure 248 120 -51.6% <0.0001 Off-path: teardown-only diff. Source CV ~36%, base ~88%, noisy history — variance, not PR-driven.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

Debug info
  • StressHouse run: 6db4406e-598d-4d0e-b962-7b964a9ad6d4
  • MIRAI run: 9106e8dc-fca4-4720-8165-22de1aad2370
  • PR check IDs:
    • clickbench_299448_1781675416
    • clickbench_299573_1781675452
    • clickbench_299721_1781675676
    • tpch_adapted_1_official_299741_1781675677
    • tpch_adapted_1_official_300017_1781676167
    • tpch_adapted_1_official_300104_1781676202

@mstetsyuk mstetsyuk enabled auto-merge June 17, 2026 13:44
@mstetsyuk mstetsyuk added this pull request to the merge queue Jun 17, 2026
Merged via the queue into master with commit eae0a52 Jun 17, 2026
166 checks passed
@mstetsyuk mstetsyuk deleted the static-thread-pool-shutdownall branch June 17, 2026 14:44
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 17, 2026
jaymebrd added a commit to jaymebrd/ClickHouse that referenced this pull request Jun 24, 2026
PR ClickHouse#107291 introduced StaticThreadPool::shutdownAll() and placed it
before GlobalThreadPool::instance().shutdown() in Server::main's
SCOPE_EXIT. Background threads (notably SystemLog flush) run on the
global pool and can lazily call into static pools — e.g. via
SystemLog::prepareTable -> StorageMergeTree ctor -> loadDataParts ->
getActivePartsLoadingThreadPool().get(). Tearing down the static pools
first races those callers, producing
"The MergeTreePartsLoaderThreadPool is not initialized" and TSan races
on the underlying unique_ptr. Swap the order so background threads
drain first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants