iframe-proxy

groeneai · 2026-06-13T09:45:38Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a rare server abort during shutdown (std::future_error, "The associated promise has been destructed prior to the associated state becoming ready") caused by a task scheduled through threadPoolCallbackRunnerUnsafe being dropped from the thread pool queue before it ran.

Description

threadPoolCallbackRunnerUnsafe wrapped work in a std::packaged_task and scheduled it on a ThreadPool. When the pool is shut down (ThreadPool::finalize sets shutdown and the worker drains the queue with job_data.reset()), a still-queued task is destroyed without running. The bare packaged_task then destructs its std::promise with no stored value, so ~promise() stores a broken-promise std::future_error.

std::future_error derives from std::logic_error, so when a waiter observes it via future.get() the server reports it through getCurrentExceptionMessageAndPattern, which calls abortOnFailedAssertion on any std::logic_error in debug/sanitizer builds, aborting the process during shutdown.

Observed in CI as a Stress test / serverfuzz abort (STID 2508-38c6): a TaskTracker flush task scheduled through threadPoolCallbackRunnerUnsafe (async WriteBufferFromS3 path) was dropped on shutdown, and the waiting merge task (MergePlainMergeTreeTask::executeStep) aborted.

Report (arm_tsan): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105248&sha=c8f5c4e0660ad27d0509192338a1e38e4dcad792&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29

Fix: replace the packaged_task with a small runnable that owns its std::promise and, when destroyed without having run, satisfies the promise with a normal DB::Exception (CANNOT_SCHEDULE_TASK) instead of leaving it broken. DB::Exception is not a std::logic_error, so the waiter now gets an ordinary, catchable error and the server no longer aborts. This mirrors the explicit-promise handling already used by the sibling ThreadPoolCallbackRunnerLocal (PR #73629) and preserves the existing ThreadGroupSwitcher-before-set_value ordering and the SCOPE_EXIT_SAFE callback release.

Added a unit test (gtest_thread_pool_callback_runner_shutdown) that deterministically drops queued runner tasks on shutdown and asserts their futures never carry a broken-promise std::future_error.

No related open issue found.

Version info

Merged into: 26.7.1.191

…shutdown threadPoolCallbackRunnerUnsafe wrapped work in a std::packaged_task and scheduled it on a ThreadPool. When the pool is shut down (ThreadPool::finalize sets shutdown and the worker drains the queue with job_data.reset()), a still-queued task is destroyed without running. The bare packaged_task then destructs its std::promise with no stored value, so ~promise() stores a broken-promise std::future_error. std::future_error derives from std::logic_error, so when a waiter observes it via future.get() the server reports it through getCurrentExceptionMessageAndPattern, which calls abortOnFailedAssertion on any std::logic_error in debug/sanitizer builds, aborting the process during shutdown. Observed in CI as a Stress test / serverfuzz abort (STID 2508-38c6): a TaskTracker flush task scheduled through threadPoolCallbackRunnerUnsafe was dropped on shutdown, and the waiting merge task aborted with "The associated promise has been destructed prior to the associated state becoming ready". Replace the packaged_task with a small runnable that owns its std::promise and, when it is destroyed without having run, satisfies the promise with a normal DB::Exception (CANNOT_SCHEDULE_TASK) instead of leaving it broken. DB::Exception is not a std::logic_error, so the waiter now observes an ordinary, catchable error and the server no longer aborts. This mirrors the explicit-promise handling already used by the sibling ThreadPoolCallbackRunnerLocal (PR ClickHouse#73629) and preserves the existing ThreadGroupSwitcher-before-set ordering and the SCOPE_EXIT_SAFE callback release. Add a unit test that deterministically drops queued runner tasks on shutdown and asserts their futures never carry a broken-promise std::future_error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-13T09:45:59Z

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. New unit test `ThreadPoolCallbackRunner.DroppedOnShutdownDoesNotBreakPromise` reproduces it on iteration 0 every run: a `max_threads=1` pool, a blocker task that throws (triggering `shutdown_on_exception`), and 32 victim tasks queued behind it via `threadPoolCallbackRunnerUnsafe`. On shutdown the worker drains the victims unrun, producing the exact CIDB signature "The associated promise has been destructed prior to the associated state becoming ready".
b	Root cause explained?	`threadPoolCallbackRunnerUnsafe` wraps work in a `std::packaged_task`. `ThreadPool::finalize()` sets `shutdown` and the worker drains the queue with `job_data.reset()` (ThreadPool.cpp), destroying still-queued tasks unrun. `~packaged_task` → `~promise()` stores a broken-promise `std::future_error`, which derives from `std::logic_error`; when a waiter observes it via `future.get()`, `getCurrentExceptionMessageAndPattern` calls `abortOnFailedAssertion` on any `std::logic_error` in debug/sanitizer builds → server abort.
c	Fix matches root cause?	Yes. The runner now uses a runnable that owns its `std::promise` and, when destroyed without running, satisfies the promise with a normal `DB::Exception(CANNOT_SCHEDULE_TASK)`. `DB::Exception` is not a `std::logic_error`, so the dropped-task future carries an ordinary catchable error rather than a broken promise — addressing the exact mechanism in (b), not a symptom.
d	Test intent preserved / new tests added?	New regression unit test added (`gtest_thread_pool_callback_runner_shutdown.cpp`). No existing test weakened. All 7 existing `ThreadPool.*` gtests still pass (incl. `ExceptionFromSchedule`, `ExceptionFromWait`).
e	Both directions demonstrated?	Yes. Without the fix (master header): test FAILS with the broken-promise `future_error`. With the fix: test PASSES (5/5 repeats). Build IDs differ between the two binaries (`ab722227…` → `df3c95a3…`), confirming the fix is in the tested binary.
f	Fix is general, not a narrow patch?	Yes. The fix lives in the runner itself and covers every `Result` type (`void` and non-void: `BatchAndHasNext`, `ObjectInfoPtr`, `ReaderHolder`, `MarkCache::MappedPtr`, …) and both drop paths (queue-drop on shutdown via the destructor, and schedule-time throw via the `catch`). The sibling `ThreadPoolCallbackRunnerLocal` was already given equivalent explicit-promise handling in PR #73629; this brings the `Unsafe` runner in line. 180 TUs recompiled with the changed header under `-Weverything -Werror` with 0 warnings.

Session id: cron:clickhouse-worker-slot-2:20260613-074600

groeneai · 2026-06-13T09:46:37Z

cc @Algunenano @al13n321 — could you review? threadPoolCallbackRunnerUnsafe scheduled work as a bare std::packaged_task; when the pool drops a still-queued task on shutdown (finalize → worker job_data.reset()), ~promise() stores a broken-promise std::future_error (a std::logic_error) that aborts the server via abortOnFailedAssertion when a waiter calls future.get() (CI STID 2508-38c6). The fix gives the runner an explicit promise satisfied with a normal DB::Exception on unrun-destruction, mirroring the ThreadPoolCallbackRunnerLocal handling from #73629.

clickhouse-gh · 2026-06-13T09:49:25Z

Workflow [PR], commit [f0799da]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR replaces the std::packaged_task used by threadPoolCallbackRunnerUnsafe with an explicit std::promise owner so tasks dropped from a shutting-down ThreadPool publish a normal DB::Exception instead of a broken-promise std::future_error. I rechecked the current PR-head code, the ThreadPool shutdown path, scheduling-failure behavior, existing threadPoolCallbackRunnerUnsafe result types, the new regression test, and the prior review threads. The earlier bot findings are addressed in the current code, and I found no remaining correctness issue that needs a new inline comment.

Final Verdict

Status: ✅ Approve

No required actions.

The new gtest's catch (...) around pool.wait() tripped the catch_all style check (silently swallows). pool.wait() rethrows only the blocker job's std::runtime_error, so catch the concrete std::exception instead; anything unexpected now propagates and fails the test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… construction throws Addresses review feedback on the broken-promise fix: CallbackRunnerTask's destructor constructed the DB::Exception as the argument to make_exception_ptr, outside any guard. Constructing a DB::Exception captures a stack trace and may itself throw (e.g. std::bad_alloc under shutdown memory pressure); the surrounding catch (...) then swallowed it and the std::promise destructed with no stored value, recreating the broken-promise std::future_error this class is meant to prevent. Build the exception_ptr inside its own try first and, if constructing the intended DB::Exception fails, fall back to std::current_exception() (always non-null inside a catch handler). The promise is then unconditionally satisfied via set_exception, so the dropped-task path can never leave it unset. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…d-task path Addresses review feedback: the dropped-task path satisfied the promise while the callback was still a live member, only destroyed after the destructor body returned. set_exception can wake a waiter immediately, and a waiter that treats future.get() as "the async task is done" could then tear down owning/thread-group state that the callback's captures still reference. Mirror the run() path: release the callback (destroying its captures) under the task's ThreadGroupSwitcher and before satisfying the promise, via SCOPE_EXIT_SAFE. ThreadGroupSwitcher's constructor is noexcept and a no-op for a null or identical group, so it is safe on the shutdown drain thread (the dropped job is destroyed under ALLOW_ALLOCATIONS_IN_SCOPE in ThreadPool::worker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The earlier Style fix narrowed catch(...) to typed catches but left the catch bodies empty, which the arm_tidy clang-tidy gate flags as bugprone-empty-catch (WERROR). Handle the expected exceptions instead of swallowing them: SUCCEED() on the blocker's rethrow, and assert the dropped task's DB::Exception carries CANNOT_SCHEDULE_TASK. This satisfies the lint and strengthens the test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

clickhouse-gh · 2026-06-13T15:31:36Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.70%	84.70%	+0.00%
Functions	92.30%	92.30%	+0.00%
Branches	77.30%	77.30%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 63/67 (94.03%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code

Full report · Diff report

groeneai · 2026-06-13T16:55:55Z

In Coordination::ZooKeeper::sendThread, after a request is popped from the queue the local RequestInfo is the sole owner of its callback until the request is inserted into operations (from which receiveEvent()/finalize() later satisfy it). Async callers such as ZooKeeper::asyncTryExistsNoThrow capture a shared_ptr<std::promise<...>> in that callback and hand the std::future to a waiting caller before the request is processed. If anything in the pop->insert window throws (the OpenTelemetry span finalize, addRootPath, or the operations map insert, all of which can allocate and throw under memory pressure), the local RequestInfo is destroyed while unwinding and the callback never runs. The captured std::promise is then destroyed unsatisfied, so the waiter's future.get() observes a std::future_error (broken promise). That future_error is a std::logic_error, which the server reports as a LOGICAL_ERROR and aborts on (see PR ClickHouse#51907). This matches the reported crash (STID 2508-34fb, ZooKeeper-client variant): ~promise<ExistsResponse> running from sendThread's teardown. Arm a SCOPE_EXIT guard over that window: if the request is dropped before its callback ownership transfers to operations, satisfy the callback with an error response (ZCONNECTIONLOSS if it was probably sent, otherwise ZSESSIONEXPIRED) instead of abandoning the promise. This mirrors how finalize() drains outstanding operations and the already-merged sibling fixes for the same error class (TestKeeper, PR ClickHouse#73570; the thread-pool callback runner, PR ClickHouse#107383). The fix is at the send layer, so it covers every async request type uniformly. Add a unit test reproducing the window contract: without the guard the future carries a broken-promise future_error; with the guard it carries a normal ZSESSIONEXPIRED response and nothing aborts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 13, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 13, 2026

clickhouse-gh Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread src/Common/threadPoolCallbackRunner.h Outdated

groeneai mentioned this pull request Jun 13, 2026

Fix arrayRemove logical error when comparing incompatible Variant types #105248

Merged

1 task

clickhouse-gh Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread src/Common/threadPoolCallbackRunner.h Outdated

groeneai and others added 3 commits June 13, 2026 10:34

This was referenced Jun 14, 2026

Filesystem cache: encode segment size in the cache file name #107415

Open

Fix broken-promise server abort in ZooKeeper client sendThread #107647

Merged

Web UI: keep single query unselected on run #105711

Merged

This was referenced Jun 25, 2026

Fix Bad cast for a qualified asterisk over a JOIN USING key nested under a PASTE/CROSS join #108043

Merged

Allow KILL QUERY/MUTATION SELECT to return multiple blocks #104927

Open

Allowlist remote database engine connection errors in upgrade check #108560

Merged

alexey-milovidov approved these changes Jun 28, 2026

View reviewed changes

alexey-milovidov self-assigned this Jun 28, 2026

alexey-milovidov added this pull request to the merge queue Jun 28, 2026

Merged via the queue into ClickHouse:master with commit 5430376 Jun 28, 2026
166 checks passed

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 28, 2026

groeneai mentioned this pull request Jun 28, 2026

Fix broken_promise abort when scheduling the final S3 multipart upload task fails #108730

Merged

alexey-milovidov mentioned this pull request Jul 2, 2026

Fix "already attached to a group" abort when a callback-runner task is dropped synchronously #109111

Open

groeneai mentioned this pull request Jul 2, 2026

Support DateTime64 values before 1900 and after 2299 #107907

Open

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix broken-promise server abort in threadPoolCallbackRunnerUnsafe on shutdown#107383

Fix broken-promise server abort in threadPoolCallbackRunnerUnsafe on shutdown#107383
alexey-milovidov merged 5 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-2508-38c6-callbackrunner-broken-promise

groeneai commented Jun 13, 2026 •

edited by robot-clickhouse-ci-1

Loading

Uh oh!

groeneai commented Jun 13, 2026

Uh oh!

groeneai commented Jun 13, 2026

Uh oh!

clickhouse-gh Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented Jun 13, 2026

Uh oh!

groeneai commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 13, 2026 • edited by robot-clickhouse-ci-1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Version info

Uh oh!

groeneai commented Jun 13, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 13, 2026

Uh oh!

clickhouse-gh Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented Jun 13, 2026

LLVM Coverage Report

Uh oh!

groeneai commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

groeneai commented Jun 13, 2026 •

edited by robot-clickhouse-ci-1

Loading

clickhouse-gh Bot commented Jun 13, 2026 •

edited

Loading