Fix broken-promise server abort in threadPoolCallbackRunnerUnsafe on shutdown#107383
Conversation
…shutdown threadPoolCallbackRunnerUnsafe wrapped work in a std::packaged_task and scheduled it on a ThreadPool. When the pool is shut down (ThreadPool::finalize sets shutdown and the worker drains the queue with job_data.reset()), a still-queued task is destroyed without running. The bare packaged_task then destructs its std::promise with no stored value, so ~promise() stores a broken-promise std::future_error. std::future_error derives from std::logic_error, so when a waiter observes it via future.get() the server reports it through getCurrentExceptionMessageAndPattern, which calls abortOnFailedAssertion on any std::logic_error in debug/sanitizer builds, aborting the process during shutdown. Observed in CI as a Stress test / serverfuzz abort (STID 2508-38c6): a TaskTracker flush task scheduled through threadPoolCallbackRunnerUnsafe was dropped on shutdown, and the waiting merge task aborted with "The associated promise has been destructed prior to the associated state becoming ready". Replace the packaged_task with a small runnable that owns its std::promise and, when it is destroyed without having run, satisfies the promise with a normal DB::Exception (CANNOT_SCHEDULE_TASK) instead of leaving it broken. DB::Exception is not a std::logic_error, so the waiter now observes an ordinary, catchable error and the server no longer aborts. This mirrors the explicit-promise handling already used by the sibling ThreadPoolCallbackRunnerLocal (PR ClickHouse#73629) and preserves the existing ThreadGroupSwitcher-before-set ordering and the SCOPE_EXIT_SAFE callback release. Add a unit test that deterministically drops queued runner tasks on shutdown and asserts their futures never carry a broken-promise std::future_error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @Algunenano @al13n321 — could you review? |
|
Workflow [PR], commit [f0799da] Summary: ✅
AI ReviewSummaryThis PR replaces the Final VerdictStatus: ✅ Approve No required actions. |
The new gtest's catch (...) around pool.wait() tripped the catch_all style check (silently swallows). pool.wait() rethrows only the blocker job's std::runtime_error, so catch the concrete std::exception instead; anything unexpected now propagates and fails the test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… construction throws Addresses review feedback on the broken-promise fix: CallbackRunnerTask's destructor constructed the DB::Exception as the argument to make_exception_ptr, outside any guard. Constructing a DB::Exception captures a stack trace and may itself throw (e.g. std::bad_alloc under shutdown memory pressure); the surrounding catch (...) then swallowed it and the std::promise destructed with no stored value, recreating the broken-promise std::future_error this class is meant to prevent. Build the exception_ptr inside its own try first and, if constructing the intended DB::Exception fails, fall back to std::current_exception() (always non-null inside a catch handler). The promise is then unconditionally satisfied via set_exception, so the dropped-task path can never leave it unset. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d-task path Addresses review feedback: the dropped-task path satisfied the promise while the callback was still a live member, only destroyed after the destructor body returned. set_exception can wake a waiter immediately, and a waiter that treats future.get() as "the async task is done" could then tear down owning/thread-group state that the callback's captures still reference. Mirror the run() path: release the callback (destroying its captures) under the task's ThreadGroupSwitcher and before satisfying the promise, via SCOPE_EXIT_SAFE. ThreadGroupSwitcher's constructor is noexcept and a no-op for a null or identical group, so it is safe on the shutdown drain thread (the dropped job is destroyed under ALLOW_ALLOCATIONS_IN_SCOPE in ThreadPool::worker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The earlier Style fix narrowed catch(...) to typed catches but left the catch bodies empty, which the arm_tidy clang-tidy gate flags as bugprone-empty-catch (WERROR). Handle the expected exceptions instead of swallowing them: SUCCEED() on the blocker's rethrow, and assert the dropped task's DB::Exception carries CANNOT_SCHEDULE_TASK. This satisfies the lint and strengthens the test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 63/67 (94.03%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code |
In Coordination::ZooKeeper::sendThread, after a request is popped from the queue the local RequestInfo is the sole owner of its callback until the request is inserted into operations (from which receiveEvent()/finalize() later satisfy it). Async callers such as ZooKeeper::asyncTryExistsNoThrow capture a shared_ptr<std::promise<...>> in that callback and hand the std::future to a waiting caller before the request is processed. If anything in the pop->insert window throws (the OpenTelemetry span finalize, addRootPath, or the operations map insert, all of which can allocate and throw under memory pressure), the local RequestInfo is destroyed while unwinding and the callback never runs. The captured std::promise is then destroyed unsatisfied, so the waiter's future.get() observes a std::future_error (broken promise). That future_error is a std::logic_error, which the server reports as a LOGICAL_ERROR and aborts on (see PR ClickHouse#51907). This matches the reported crash (STID 2508-34fb, ZooKeeper-client variant): ~promise<ExistsResponse> running from sendThread's teardown. Arm a SCOPE_EXIT guard over that window: if the request is dropped before its callback ownership transfers to operations, satisfy the callback with an error response (ZCONNECTIONLOSS if it was probably sent, otherwise ZSESSIONEXPIRED) instead of abandoning the promise. This mirrors how finalize() drains outstanding operations and the already-merged sibling fixes for the same error class (TestKeeper, PR ClickHouse#73570; the thread-pool callback runner, PR ClickHouse#107383). The fix is at the send layer, so it covers every async request type uniformly. Add a unit test reproducing the window contract: without the guard the future carries a broken-promise future_error; with the guard it carries a normal ZSESSIONEXPIRED response and nothing aborts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a rare server abort during shutdown (
std::future_error, "The associated promise has been destructed prior to the associated state becoming ready") caused by a task scheduled throughthreadPoolCallbackRunnerUnsafebeing dropped from the thread pool queue before it ran.Description
threadPoolCallbackRunnerUnsafewrapped work in astd::packaged_taskand scheduled it on aThreadPool. When the pool is shut down (ThreadPool::finalizesetsshutdownand the worker drains the queue withjob_data.reset()), a still-queued task is destroyed without running. The barepackaged_taskthen destructs itsstd::promisewith no stored value, so~promise()stores a broken-promisestd::future_error.std::future_errorderives fromstd::logic_error, so when a waiter observes it viafuture.get()the server reports it throughgetCurrentExceptionMessageAndPattern, which callsabortOnFailedAssertionon anystd::logic_errorin debug/sanitizer builds, aborting the process during shutdown.Observed in CI as a
Stress test/serverfuzzabort (STID 2508-38c6): aTaskTrackerflush task scheduled throughthreadPoolCallbackRunnerUnsafe(asyncWriteBufferFromS3path) was dropped on shutdown, and the waiting merge task (MergePlainMergeTreeTask::executeStep) aborted.Report (arm_tsan): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105248&sha=c8f5c4e0660ad27d0509192338a1e38e4dcad792&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29
Fix: replace the
packaged_taskwith a small runnable that owns itsstd::promiseand, when destroyed without having run, satisfies the promise with a normalDB::Exception(CANNOT_SCHEDULE_TASK) instead of leaving it broken.DB::Exceptionis not astd::logic_error, so the waiter now gets an ordinary, catchable error and the server no longer aborts. This mirrors the explicit-promise handling already used by the siblingThreadPoolCallbackRunnerLocal(PR #73629) and preserves the existingThreadGroupSwitcher-before-set_valueordering and theSCOPE_EXIT_SAFEcallback release.Added a unit test (
gtest_thread_pool_callback_runner_shutdown) that deterministically drops queued runner tasks on shutdown and asserts their futures never carry a broken-promisestd::future_error.No related open issue found.
Version info
26.7.1.191