Fix broken-promise server abort in threadPoolCallbackRunnerUnsafe on shutdown by groeneai · Pull Request #107383 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix broken-promise server abort in threadPoolCallbackRunnerUnsafe on shutdown#107383

Merged
alexey-milovidov merged 5 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-2508-38c6-callbackrunner-broken-promise
Jun 28, 2026
Merged

Fix broken-promise server abort in threadPoolCallbackRunnerUnsafe on shutdown#107383
alexey-milovidov merged 5 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-2508-38c6-callbackrunner-broken-promise

Conversation

@groeneai

@groeneai groeneai commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a rare server abort during shutdown (std::future_error, "The associated promise has been destructed prior to the associated state becoming ready") caused by a task scheduled through threadPoolCallbackRunnerUnsafe being dropped from the thread pool queue before it ran.

Description

threadPoolCallbackRunnerUnsafe wrapped work in a std::packaged_task and scheduled it on a ThreadPool. When the pool is shut down (ThreadPool::finalize sets shutdown and the worker drains the queue with job_data.reset()), a still-queued task is destroyed without running. The bare packaged_task then destructs its std::promise with no stored value, so ~promise() stores a broken-promise std::future_error.

std::future_error derives from std::logic_error, so when a waiter observes it via future.get() the server reports it through getCurrentExceptionMessageAndPattern, which calls abortOnFailedAssertion on any std::logic_error in debug/sanitizer builds, aborting the process during shutdown.

Observed in CI as a Stress test / serverfuzz abort (STID 2508-38c6): a TaskTracker flush task scheduled through threadPoolCallbackRunnerUnsafe (async WriteBufferFromS3 path) was dropped on shutdown, and the waiting merge task (MergePlainMergeTreeTask::executeStep) aborted.

Report (arm_tsan): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105248&sha=c8f5c4e0660ad27d0509192338a1e38e4dcad792&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29

Fix: replace the packaged_task with a small runnable that owns its std::promise and, when destroyed without having run, satisfies the promise with a normal DB::Exception (CANNOT_SCHEDULE_TASK) instead of leaving it broken. DB::Exception is not a std::logic_error, so the waiter now gets an ordinary, catchable error and the server no longer aborts. This mirrors the explicit-promise handling already used by the sibling ThreadPoolCallbackRunnerLocal (PR #73629) and preserves the existing ThreadGroupSwitcher-before-set_value ordering and the SCOPE_EXIT_SAFE callback release.

Added a unit test (gtest_thread_pool_callback_runner_shutdown) that deterministically drops queued runner tasks on shutdown and asserts their futures never carry a broken-promise std::future_error.

No related open issue found.

Version info

  • Merged into: 26.7.1.191

…shutdown

threadPoolCallbackRunnerUnsafe wrapped work in a std::packaged_task and scheduled
it on a ThreadPool. When the pool is shut down (ThreadPool::finalize sets shutdown
and the worker drains the queue with job_data.reset()), a still-queued task is
destroyed without running. The bare packaged_task then destructs its std::promise
with no stored value, so ~promise() stores a broken-promise std::future_error.

std::future_error derives from std::logic_error, so when a waiter observes it via
future.get() the server reports it through getCurrentExceptionMessageAndPattern,
which calls abortOnFailedAssertion on any std::logic_error in debug/sanitizer
builds, aborting the process during shutdown.

Observed in CI as a Stress test / serverfuzz abort (STID 2508-38c6): a TaskTracker
flush task scheduled through threadPoolCallbackRunnerUnsafe was dropped on shutdown,
and the waiting merge task aborted with "The associated promise has been destructed
prior to the associated state becoming ready".

Replace the packaged_task with a small runnable that owns its std::promise and, when
it is destroyed without having run, satisfies the promise with a normal DB::Exception
(CANNOT_SCHEDULE_TASK) instead of leaving it broken. DB::Exception is not a
std::logic_error, so the waiter now observes an ordinary, catchable error and the
server no longer aborts. This mirrors the explicit-promise handling already used by
the sibling ThreadPoolCallbackRunnerLocal (PR ClickHouse#73629) and preserves the existing
ThreadGroupSwitcher-before-set ordering and the SCOPE_EXIT_SAFE callback release.

Add a unit test that deterministically drops queued runner tasks on shutdown and
asserts their futures never carry a broken-promise std::future_error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @Algunenano @al13n321 — could you review? threadPoolCallbackRunnerUnsafe scheduled work as a bare std::packaged_task; when the pool drops a still-queued task on shutdown (finalize → worker job_data.reset()), ~promise() stores a broken-promise std::future_error (a std::logic_error) that aborts the server via abortOnFailedAssertion when a waiter calls future.get() (CI STID 2508-38c6). The fix gives the runner an explicit promise satisfied with a normal DB::Exception on unrun-destruction, mirroring the ThreadPoolCallbackRunnerLocal handling from #73629.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 13, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [f0799da]

Summary:


AI Review

Summary

This PR replaces the std::packaged_task used by threadPoolCallbackRunnerUnsafe with an explicit std::promise owner so tasks dropped from a shutting-down ThreadPool publish a normal DB::Exception instead of a broken-promise std::future_error. I rechecked the current PR-head code, the ThreadPool shutdown path, scheduling-failure behavior, existing threadPoolCallbackRunnerUnsafe result types, the new regression test, and the prior review threads. The earlier bot findings are addressed in the current code, and I found no remaining correctness issue that needs a new inline comment.

Final Verdict

Status: ✅ Approve

No required actions.

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 13, 2026
Comment thread src/Common/threadPoolCallbackRunner.h Outdated
The new gtest's catch (...) around pool.wait() tripped the catch_all style
check (silently swallows). pool.wait() rethrows only the blocker job's
std::runtime_error, so catch the concrete std::exception instead; anything
unexpected now propagates and fails the test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread src/Common/threadPoolCallbackRunner.h Outdated
groeneai and others added 3 commits June 13, 2026 10:34
… construction throws

Addresses review feedback on the broken-promise fix: CallbackRunnerTask's
destructor constructed the DB::Exception as the argument to make_exception_ptr,
outside any guard. Constructing a DB::Exception captures a stack trace and may
itself throw (e.g. std::bad_alloc under shutdown memory pressure); the surrounding
catch (...) then swallowed it and the std::promise destructed with no stored
value, recreating the broken-promise std::future_error this class is meant to
prevent.

Build the exception_ptr inside its own try first and, if constructing the intended
DB::Exception fails, fall back to std::current_exception() (always non-null inside
a catch handler). The promise is then unconditionally satisfied via set_exception,
so the dropped-task path can never leave it unset.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d-task path

Addresses review feedback: the dropped-task path satisfied the promise while the
callback was still a live member, only destroyed after the destructor body returned.
set_exception can wake a waiter immediately, and a waiter that treats future.get()
as "the async task is done" could then tear down owning/thread-group state that the
callback's captures still reference.

Mirror the run() path: release the callback (destroying its captures) under the
task's ThreadGroupSwitcher and before satisfying the promise, via SCOPE_EXIT_SAFE.
ThreadGroupSwitcher's constructor is noexcept and a no-op for a null or identical
group, so it is safe on the shutdown drain thread (the dropped job is destroyed
under ALLOW_ALLOCATIONS_IN_SCOPE in ThreadPool::worker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The earlier Style fix narrowed catch(...) to typed catches but left the
catch bodies empty, which the arm_tidy clang-tidy gate flags as
bugprone-empty-catch (WERROR). Handle the expected exceptions instead of
swallowing them: SUCCEED() on the blocker's rethrow, and assert the dropped
task's DB::Exception carries CANNOT_SCHEDULE_TASK. This satisfies the lint
and strengthens the test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.70% 84.70% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.30% 77.30% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 63/67 (94.03%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code

Full report · Diff report

@groeneai

Copy link
Copy Markdown
Contributor Author

pull Bot pushed a commit to AKJUS/ClickHouse that referenced this pull request Jun 25, 2026
In Coordination::ZooKeeper::sendThread, after a request is popped from the
queue the local RequestInfo is the sole owner of its callback until the request
is inserted into operations (from which receiveEvent()/finalize() later satisfy
it). Async callers such as ZooKeeper::asyncTryExistsNoThrow capture a
shared_ptr<std::promise<...>> in that callback and hand the std::future to a
waiting caller before the request is processed.

If anything in the pop->insert window throws (the OpenTelemetry span finalize,
addRootPath, or the operations map insert, all of which can allocate and throw
under memory pressure), the local RequestInfo is destroyed while unwinding and
the callback never runs. The captured std::promise is then destroyed unsatisfied,
so the waiter's future.get() observes a std::future_error (broken promise). That
future_error is a std::logic_error, which the server reports as a LOGICAL_ERROR
and aborts on (see PR ClickHouse#51907). This matches the reported crash (STID 2508-34fb,
ZooKeeper-client variant): ~promise<ExistsResponse> running from sendThread's
teardown.

Arm a SCOPE_EXIT guard over that window: if the request is dropped before its
callback ownership transfers to operations, satisfy the callback with an error
response (ZCONNECTIONLOSS if it was probably sent, otherwise ZSESSIONEXPIRED)
instead of abandoning the promise. This mirrors how finalize() drains
outstanding operations and the already-merged sibling fixes for the same error
class (TestKeeper, PR ClickHouse#73570; the thread-pool callback runner, PR ClickHouse#107383). The
fix is at the send layer, so it covers every async request type uniformly.

Add a unit test reproducing the window contract: without the guard the future
carries a broken-promise future_error; with the guard it carries a normal
ZSESSIONEXPIRED response and nothing aborts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov self-assigned this Jun 28, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 28, 2026
Merged via the queue into ClickHouse:master with commit 5430376 Jun 28, 2026
166 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants