Fix spurious resource request failure from reused `ResourceGuard::Request` by alexey-milovidov · Pull Request #106690 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix spurious resource request failure from reused ResourceGuard::Request#106690

Merged
alexey-milovidov merged 8 commits into
masterfrom
fix-resourceguard-stale-exception
Jun 13, 2026
Merged

Fix spurious resource request failure from reused ResourceGuard::Request#106690
alexey-milovidov merged 8 commits into
masterfrom
fix-resourceguard-stale-exception

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 7, 2026

Copy link
Copy Markdown
Member

ResourceGuard::Request is reused per thread via the Request::local() thread-local instance. Its exception member is set by failed() but was never cleared on reuse, so a thread that had previously run a failed resource request kept a stale exception. A subsequent request scheduled on that thread could be granted via execute() (no exception set), yet Request::wait() would still observe the stale exception and throw RESOURCE_ACCESS_DENIED ("Resource request failed: ... Scheduler queue with resource request is about to be destructed"), even though that request had actually been granted.

This can surface to users as a query spuriously failing with RESOURCE_ACCESS_DENIED when restructuring/dropping a workload's queue (e.g. CREATE WORKLOAD/ALTER/DROP) fails one request and the same worker thread is then reused for another request that is in fact granted.

The fix clears exception in Request::enqueue() (the reuse reset point) so a granted request can never observe a previous request's failure. Genuinely failed requests still throw: failed() sets exception and wait() throws as before; the value is only reset when the request is reused for a new enqueue.

This was found while investigating a reliable deadlock in SchedulerWorkloadResourceManager.DropNotEmptyQueueLong exposed by the LIFO worker-thread wake order; it is a pre-existing latent bug that makes sense to fix on its own. Verified against SchedulerWorkloadResourceManager.*: passes under ThreadSanitizer and under heavy parallel stress (previously a reliable hang/spurious failure).

Related: #100177

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a rare spurious RESOURCE_ACCESS_DENIED error ("Scheduler queue with resource request is about to be destructed") for a query that was actually granted access to a workload resource, caused by a reused thread-local request object retaining a previous request's failure.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.6.1.760

…uest`

`ResourceGuard::Request` is reused per thread via the `Request::local()`
thread-local instance. Its `exception` member is set by `failed()` but was
never cleared on reuse, so a thread that had previously run a failed resource
request kept a stale `exception`. A subsequent request scheduled on that thread
could be granted via `execute()` (no exception set), yet `Request::wait()` would
still observe the stale `exception` and throw `RESOURCE_ACCESS_DENIED`
("Resource request failed: ... Scheduler queue with resource request is about to
be destructed"), even though that request had actually been granted.

This can surface to users as a query spuriously failing with
`RESOURCE_ACCESS_DENIED` when a workload's queue is restructured/dropped (e.g.
`CREATE WORKLOAD`/`ALTER`/`DROP`) fails one request and the same worker thread
is then reused for another request that is in fact granted.

Clear `exception` in `Request::enqueue()` (the reuse reset point) so a granted
request can never observe a previous request's failure. Genuinely failed
requests still throw: `failed()` sets `exception` and `wait()` throws as before;
the value is only reset when the request is reused for a new enqueue.

Related: #100177

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 7, 2026
Comment thread src/Common/Scheduler/ResourceGuard.h
alexey-milovidov and others added 2 commits June 8, 2026 23:00
…d regression tests

The default-lock `ResourceGuard` constructor enqueues the request and then
calls `Request::wait()` inside the constructor. If either `enqueue()` or
`wait()` throws (the request failed because its queue is being destructed),
the constructor propagates the exception and `~ResourceGuard()` never runs,
so the request is never finished. The reused thread-local `Request::local()`
instance is then left in the `Enqueued` or `Dequeued` state, and the next
request enqueued on that thread hits `chassert(state == Finished)` in
`Request::enqueue()` (an exception in debug/sanitizer builds), even though
this pull request is specifically making that same-thread reuse valid.

Wrap the constructor's `enqueue`/`wait` in a `try`/`catch` and restore the
reused request to the `Finished` state before rethrowing, via the new
`Request::recoverAfterConstructorFailure`. For a request failed in `wait()`
(state `Dequeued`) this mirrors what `~ResourceGuard()` does on the
`Lock::Defer` path: `finish` is a documented no-op for a failed request. For
a request that failed in `enqueue()` (state `Enqueued`) the request never
entered the scheduler, so only the state is reset.

Add two regression tests in `gtest_workload_resource_manager`:
- `ReuseRequestAfterFailedDeferRequestIsGranted` covers the original stale
  `exception` reset: a `Lock::Defer` request fails (queue destroyed) and a
  subsequent granted request on the same thread must not observe the previous
  failure. Without clearing `exception` in `enqueue()`, the granted request
  spuriously throws `RESOURCE_ACCESS_DENIED`.
- `ReuseRequestAfterFailedDefaultRequestIsGranted` covers the constructor
  failure path above: a `Lock::Default` request fails and a granted request
  is then issued on the same thread.

An uncaught exception in a worker thread is swallowed by the thread pool, so
the granted reuse is wrapped in `EXPECT_NO_THROW` to surface failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Common/Scheduler/ResourceGuard.h
@clickhouse-gh

clickhouse-gh Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR only fixes an exception-recovery bug in the workload I/O scheduler's ResourceGuard (clearing a stale error on a reused request) plus adds two regression tests; it does not touch the normal query-execution path. The two flagged improvements on tpch_adapted_1_official (Q7 -32%, Q20 -44%) therefore cannot plausibly be caused by this change, and the source side had few runs with noisy measurements (Q20 especially). Both have been downgraded to not-sure as likely run-to-run variance rather than a real PR effect.

clickbench

🟢 No significant changes

tpch_adapted_1_official

⚠️ 2 inconclusive

Flagged queries (2 of 22)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
⚠️ 7 not_sure 112 76 -32.1% <0.0001 PR only touches the I/O scheduler ResourceGuard error-recovery path; cannot speed up a TPC-H join. Off-path noise.
⚠️ 20 not_sure 241 135 -44.0% <0.0001 Same off-path change; source runs noisy (cv 36%) on only 6 runs. -44% is run-to-run variance, not this PR.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

Debug info
  • StressHouse run: 41f7e294-9ad8-44f5-a7aa-3f44ca378e84
  • MIRAI run: 9f9e9087-3ed7-427f-a635-f6a6d9078c4e
  • PR check IDs:
    • clickbench_67596_1780971423
    • clickbench_67694_1780971521
    • clickbench_67803_1780971558
    • tpch_adapted_1_official_67934_1780971781
    • tpch_adapted_1_official_68019_1780971811
    • tpch_adapted_1_official_68115_1780971845

alexey-milovidov and others added 2 commits June 10, 2026 13:40
…sourceTest.h`

The style check flags `ErrorCodes::INVALID_SCHEDULER_NODE` as defined but
not used in `ResourceTest.h`: the check excludes `tests/*.cpp` files but
not headers, so the declaration must live in
`gtest_workload_resource_manager.cpp` where the code is actually used.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Address review: `enqueueRequestUsingBudget` commits `budget.ask` before
calling the concrete `enqueueRequest`, which can throw before linking the
request into the queue (`INVALID_SCHEDULER_NODE` when the queue is being
destructed, or `SERVER_OVERLOADED` when `max_waiting_queries` is reached).
The request never enters the scheduler and `finish` never runs for it, but
the queue budget had already been changed assuming the request would be
granted `request->cost` and consume `estimated_cost`. In particular, when
the queue carries a debt from a previous over-consumption, `ask` raises the
request's cost to repay it, and a failed enqueue would silently forget the
debt, skewing costs of subsequent requests.

Wrap `enqueueRequest` in `try`/`catch` and roll the transaction back via
`budget.adjust(estimated_cost, request->cost)` before rethrowing. Add a
regression test `SchedulerRoot.BudgetAfterFailedEnqueue`: it creates a
negative budget, purges the queue so `enqueueRequest` throws, and checks
the budget is unchanged after the failed enqueue (without the rollback the
budget observably jumps from the debt value back to zero).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed two commits:

  • ef0dd5d733c fixes the style check failure: ErrorCodes::INVALID_SCHEDULER_NODE was declared in ResourceTest.h but used only in gtest_workload_resource_manager.cpp (the style check excludes tests/*.cpp but not headers), so the declaration moved to the test file.
  • e19c2c166bd addresses the review about budget exception safety: enqueueRequestUsingBudget now rolls back the budget.ask transaction if enqueueRequest throws before the request is linked, with a regression test SchedulerRoot.BudgetAfterFailedEnqueue (verified to fail without the rollback).

SchedulerRoot.* (4) and SchedulerWorkloadResourceManager.* (23) gtests pass locally; the cpp style check is clean.

Comment thread src/Common/Scheduler/ResourceGuard.h
alexey-milovidov and others added 2 commits June 13, 2026 02:53
`ResourceGuardSessionDataHooks` in `HTTPConnectionPool.cpp` is the second
owner of a `ResourceGuard::Request` (besides the thread-local instance used
by `ResourceGuard`). Its `atStart` does `request.enqueue` followed by
`request.wait`, and `HTTPSession::write`/`HTTPSession::receive` catch any
exception thrown from `atStart` (note that `DB::Exception` is a
`Poco::Exception`) and unconditionally call `atFail`.

If `enqueue` throws before the request is linked into the scheduler (for
example `INVALID_SCHEDULER_NODE` from a purged queue, or `SERVER_OVERLOADED`
from `max_waiting_queries`), the request is left in the `Enqueued` state.
The subsequent `atFail` then calls `request.finish(0, link)` on a
non-`Dequeued` request, tripping `chassert(state == Dequeued)` in
debug/sanitizer builds and adjusting the queue budget with a stale
`estimated_cost` in release builds.

Wrap `atStart` in a try/catch that calls `recoverAfterConstructorFailure`
on failure (mirroring the `ResourceGuard` constructor), and gate
`atFinish`/`atFail` behind an `active` flag set only after `wait` succeeds,
so the unconditional `atFail` during unwinding does not finish the request
twice.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

The sole CI red on the previous head (e19c2c166bd) was 01710_projection_aggregation_in_order in Stateless tests (arm_asan_ubsan, azure, parallel), which failed with Code: 159. DB::Exception: Timeout exceeded: elapsed 60032 ms, maximum: 60000 ms: While executing MergeTreeSelect(...) — a 60s timeout on the heavy ASan+UBSan+azure+parallel job. This is unrelated to the scheduler changes in this PR: the same test has timed out the same way on unrelated PRs over the last 30 days (#106734, #105246) and there is no open tracking issue. It is a slow-runner flake.

@groeneai, could you take a look at 01710_projection_aggregation_in_order timing out on arm_asan_ubsan, azure, parallel? Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=106690&sha=e19c2c166bdefa72b341a736015a546b292cf9fd&name_0=PR&name_1=Stateless%20tests%20%28arm_asan_ubsan%2C%20azure%2C%20parallel%29

// call `atFail()` while unwinding, so recover the request to the `Finished` state now and keep
// `active` false. Otherwise `atFail()` would call `request.finish()` on a non-`Dequeued` request,
// tripping `chassert(state == Dequeued)` in debug builds and corrupting the queue budget in release.
request.recoverAfterConstructorFailure(link);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recovery path needs a focused regression test for the ResourceGuardSessionDataHooks owner as well. The new tests cover the thread-local ResourceGuard::Request::local() owner and the budget rollback, but they do not exercise the Poco control flow that makes this line necessary: atStart throws, HTTPSession::write/receive catches it, and then unconditionally calls atFail.

Without a test around this hook path, dropping either this recovery call or the active guard below would leave all current regressions green while reintroducing the debug chassert / release budget-drift failure for scheduled HTTP socket I/O. Please add the smallest test that drives an HTTP session with an attached read or write ResourceLink, makes the scheduler request fail during atStart (for example by purging/filling the queue), and verifies the hook unwinds cleanly and the request can be reused/finished correctly.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov Investigated. The recorded red for 01710_projection_aggregation_in_order on e19c2c166bd is not a 60s timeout, it is the runner-wide OvercommitTracker OOM on the arm_asan_ubsan, azure, parallel shard. Same host-level event we analyzed for 04033_tpc_ds_q04 on your #106182.

Evidence (CIDB + the result_pr.json for the run you linked):

  • The failing test log is Code: 241 ... (total) memory limit exceeded ... current RSS: 54.24 GiB, maximum: 53.69 GiB. OvercommitTracker decision: Query was selected to stop ... While executing ProjectionDataSink, on the trivial INSERT INTO agg SELECT 1, ... FROM numbers(100000). It ran 1.46s before the host-cap kill, and the reproducibility check passed 0 out of 3 reruns.
  • There is no Code: 159 / Timeout exceeded row for this test in 90 days, and none for any test at e19c2c166bd. The 60032 ms MergeTreeSelect string is not in CIDB for this test or this PR.
  • 01710 is fast on this shard: 5691 OK runs, p50 1.8s, p95 4.6s, max 36.7s, never near the 60s budget. It is not a slow runner.
  • On 2026-06-10 (your run) 304 distinct tests on this shard were OvercommitTracker-killed at the identical 53.69 GiB cap (559 on 06-09). 01710 is one collateral victim, picked because OvercommitTracker had to free memory while other parallel tests held RSS at the host cap.

The contention self-resolved: shard OOM-kills dropped 304 (06-10) -> 59 (06-11) -> 12 (06-12), and 01710 has had zero failures since 06-10. This tracks the revert #107122 of the multistage-distributed-queries change (#106020) that drove the heap-corruption and thread-teardown storm on the asan shards from 06-08 to 06-10.

So no per-test fix is warranted. A long tag would not help (the kill is at 1.46s, not a duration issue), and pinning a setting or shrinking the insert would only change which query OvercommitTracker selects, not prevent the shard OOM. The only real fix is infra-level (lower per-shard test parallelism or raise the server memory budget on this runner), the same conclusion as #106182. If you are seeing a genuine Code: 159 60s MergeTreeSelect timeout on a head I am not looking at, point me at it and I will dig further.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=106690&sha=e19c2c166bdefa72b341a736015a546b292cf9fd&name_0=PR&name_1=Stateless%20tests%20%28arm_asan_ubsan%2C%20azure%2C%20parallel%29

@clickhouse-gh

clickhouse-gh Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.70% 84.70% +0.00%
Functions 92.30% 92.40% +0.10%
Branches 77.30% 77.30% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 120/127 (94.49%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 13, 2026
Merged via the queue into master with commit 97d86a3 Jun 13, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-resourceguard-stale-exception branch June 13, 2026 20:28
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants