Fix use-after-free of workload storage mutex during server shutdown by tuanpach · Pull Request #101447 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix use-after-free of workload storage mutex during server shutdown#101447

Merged
alexey-milovidov merged 12 commits into
ClickHouse:masterfrom
tuanpach:fix-workload-storage-lifetime-race
Jun 13, 2026
Merged

Fix use-after-free of workload storage mutex during server shutdown#101447
alexey-milovidov merged 12 commits into
ClickHouse:masterfrom
tuanpach:fix-workload-storage-lifetime-race

Conversation

@tuanpach

@tuanpach tuanpach commented Apr 1, 2026

Copy link
Copy Markdown
Member

During server shutdown, ContextSharedPart destroys workload_entity_storage via unique_ptr::reset while a query pipeline may still be running PipelineExecutor::allocateCPU, which holds a raw reference to the storage obtained from getWorkloadEntityStorage. This caused a use-after-free of the recursive_mutex inside WorkloadEntityStorageBase, detected by TSan as a data race in TraceEvent<EventUnlock> / TraceMutexUnlock.

Fix:

  • Change workload_entity_storage in ContextSharedPart from unique_ptr to shared_ptr.
  • Add Context::getWorkloadEntityStoragePtr returning a shared_ptr copy.
  • In PipelineExecutor::allocateCPU, hold the shared_ptr for the duration of the call so the storage object cannot be destroyed underneath it.

Test: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101060&sha=34f9537c4f02bdf3a862b3cdcad591081052235c&name_0=PR&name_1=Stress+test+%28amd_debug%29
#101060

Job log: https://pastila.clickhouse.com/?0000168c/a152326ece24275e5b4fe9dd6ab13982#ZZO9TJyd92smEO3/A1NiNQ==GCM

Failure: https://pastila.clickhouse.com/?001fdf85/ffb73f64f70fed2915e833076e3ffa75#JCRYrJgw6/Sz2vqG4S5AUw==GCM

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

During server shutdown, `ContextSharedPart` destroys `workload_entity_storage`
via `unique_ptr::reset` while a query pipeline may still be running
`PipelineExecutor::allocateCPU`, which holds a raw reference to the storage
obtained from `getWorkloadEntityStorage`. This caused a use-after-free of
the `recursive_mutex` inside `WorkloadEntityStorageBase`, detected by TSan
as a data race in `TraceEvent<EventUnlock>` / `TraceMutexUnlock`.

Fix:
- Change `workload_entity_storage` in `ContextSharedPart` from `unique_ptr`
  to `shared_ptr`.
- Add `Context::getWorkloadEntityStoragePtr` returning a `shared_ptr` copy.
- In `PipelineExecutor::allocateCPU`, hold the `shared_ptr` for the duration
  of the call so the storage object cannot be destroyed underneath it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Apr 1, 2026
Comment thread src/Interpreters/Context.cpp
tuanpach and others added 2 commits April 2, 2026 00:04
`getWorkloadEntityStoragePtr` returned `shared->workload_entity_storage`
without holding `shared->mutex`, while `shutdown` moved from that same
field under the mutex at the same time. Reading and writing one
`std::shared_ptr` object concurrently from two threads is a data race.

Fix: acquire `shared->mutex` (shared/read lock) before copying the
`shared_ptr` out, matching the exclusive lock held by the write side in
`shutdown`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous fix only guarded the read of `shared->workload_entity_storage`
but the write inside `callOnce` was still unprotected. If `shutdown` ran
concurrently with the first call to `getWorkloadEntityStoragePtr`, the
`callOnce` lambda could write to `workload_entity_storage` at the same
time as `shutdown` moved from it under `shared->mutex` — another data
race on the same `std::shared_ptr` object.

Fix: create the storage object in a local variable first (outside the
mutex, since construction can be slow), then assign to
`shared->workload_entity_storage` while holding `shared->mutex`.
Both read and write of the field are now guarded by `shared->mutex`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

Comment thread src/Interpreters/Context.cpp Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member

The failures of "Flaky check" in "functions_bad_arguments" will be fixed by #101994.

Comment thread src/Interpreters/Context.cpp
@alexey-milovidov

Copy link
Copy Markdown
Member

@groeneai, make the test 02015_async_inserts_4 repeatable, so it runs well inside the flaky check. Send a new PR with the fix, link it here.

groeneai added a commit to groeneai/ClickHouse that referenced this pull request Apr 8, 2026
The test creates users and roles with hardcoded global names
(u_02015_allowed, u_02015_denied, role_02015). When the flaky check
runs this test multiple times with parallel workers, one instance's
DROP USER races with another instance's REVOKE/CREATE, causing
UNKNOWN_ROLE errors.

Fix: suffix user and role names with $CLICKHOUSE_DATABASE (unique per
test invocation) so parallel instances operate on independent global
objects.

Reproduction: --test-runs 10 -j 4 fails 6/10 without fix, 50/50
passes with fix at -j 8.

Ref: ClickHouse#101447
@groeneai

groeneai commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov — fix PR created: #102019

Root cause: The test uses hardcoded global user/role names (u_02015_allowed, u_02015_denied, role_02015). When the flaky check runs the test multiple times with parallel workers, one instance's DROP USER IF EXISTS races with another instance's REVOKE ALL / CREATE USER, causing UNKNOWN_ROLE errors.

Fix: Suffix all user/role names with $CLICKHOUSE_DATABASE (unique per test invocation) so parallel instances don't interfere.

Verified locally: 6/10 fail without fix (-j 4), 50/50 pass with fix (-j 8).

@alexey-milovidov

Copy link
Copy Markdown
Member

The flaky test failures in this PR (01605_drop_settings_profile_while_assigned, 01418_custom_settings, 02015_async_inserts_4) are caused by non-unique access entity names and will be resolved by #102131.

fm4v added a commit that referenced this pull request Apr 14, 2026
Split flaky check and targeted check into separate roles:

Flaky check — runs only new/modified tests from the PR with thread
fuzzer enabled. Tests run in parallel with themselves (shared queue)
to ensure new tests are parallel-safe under randomized thread ordering.

Targeted check — runs previously failed and coverage-relevant tests
with per-worker partitioning (--no-self-parallel) so the same test
never runs concurrently on multiple workers, avoiding conflicts on
shared resources (hardcoded users, roles, settings profiles, etc.).
Both use --test-runs 50.

Changes:
- Add --no-self-parallel flag to clickhouse-test: partitions tests
  by name hash so all copies go to the same worker
- Flaky check: only changed tests, thread fuzzer, shared queue
- Targeted check: relevant tests, no thread fuzzer, per-worker queues,
  50 runs, single clickhouse-test invocation
- Restore original thread fuzzer parameters (reverts reductions from
  06c945a and 9790435)
- Add amd_binary to flaky check, use AMD_LARGE for targeted debug
- Targeted check uses --flaky-check --no-self-parallel

PRs affected by tests running in parallel with themselves:
#102015 #101093 #99653 #101503 #101108 #101884 #100746
#101447 #100941 #101099 #100203 #96130

#102038
fm4v added a commit that referenced this pull request Apr 14, 2026
Split flaky check and targeted check into separate roles:

Flaky check — runs only new/modified tests from the PR with thread
fuzzer enabled. Tests can run in parallel with themselves (shared
queue) to ensure new tests are parallel-safe.

Targeted check — runs previously failed and coverage-relevant tests
with per-worker partitioning (--no-self-parallel) so the same test
never runs concurrently on multiple workers, avoiding conflicts on
shared resources (hardcoded users, roles, settings profiles, etc.).
Both use --test-runs 50.

Changes:
- Add --no-self-parallel flag to clickhouse-test: partitions tests
  by name hash so all copies go to the same worker
- Flaky check: only changed tests, thread fuzzer, shared queue
- Targeted check: relevant tests, no thread fuzzer, per-worker
  partitioning, 50 runs, single clickhouse-test invocation
- Restore original thread fuzzer parameters
- Change targeted check to arm_asan_ubsan on ARM_LARGE

PRs affected by tests running in parallel with themselves:
#102015 #101093 #99653 #101503 #101108 #101884 #100746
#101447 #100941 #101099 #100203 #96130
tuanpach and others added 4 commits April 17, 2026 07:39
The `SHUTDOWN(... workload_entity_storage, stopWatching())` path read the
`shared->workload_entity_storage` `shared_ptr` without holding `shared->mutex`,
while `getWorkloadEntityStoragePtr` may concurrently initialize it under the
same mutex. A concurrent read/write of the same `std::shared_ptr` object is a
data race. Copy the pointer under a shared lock and call `stopWatching` on the
local copy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`getWorkloadEntityStorage` returns a raw reference whose owning `shared_ptr` is
discarded immediately, so callers running concurrently with shutdown could still
observe a dangling reference once the storage is destroyed. Migrate the callers
that run on query-execution paths (`ProcessList::insert`, the
`CREATE`/`DROP WORKLOAD`/`RESOURCE` interpreters, `system.workloads`,
`system.resources`, and the `DiskObjectStorage` subscription) to
`getWorkloadEntityStoragePtr`, holding the `shared_ptr` for the duration of the
access.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Continued work on this PR:

  • Merged master — the branch was ~10931 commits behind (merge base from 2026-04-17). Clean merge, no conflicts; all original PR changes intact.
  • Addressed the two remaining clickhouse-gh review threads:
    • Shutdown read race — the SHUTDOWN(... workload_entity_storage, stopWatching()) path read the shared_ptr without shared->mutex while getWorkloadEntityStoragePtr initializes it under the same mutex. Now copies the pointer under SharedLockGuard into a local and calls stopWatching on the copy (55d51c7aba0).
    • Partial fix scoped only to PipelineExecutor::allocateCPU — migrated the remaining query-path callers of getWorkloadEntityStorage to getWorkloadEntityStoragePtr, holding the shared_ptr for the call duration: ProcessList::insert, the CREATE/DROP WORKLOAD/RESOURCE interpreters, system.workloads, system.resources, and the DiskObjectStorage subscription. The only remaining reference-accessor caller is createResourceManager (startup-time, long-lived reference member — a separate ownership question) (4cc5bf0027b).

Verification: built on ARM (clean, -Werror), and ran 03232_workload_create_and_drop, 03232_workloads_and_resources (ref-match) and 03232_resource_create_and_drop (matches reference modulo DB-name prefix) against a fresh binary — all pass, exercising the migrated interpreter and system.* paths.

The previously-reported CI failures (04057/04060 transaction tests and the Inconsistent AST formatting AST-fuzzer assertion) are from the stale pre-merge commit and are unrelated to workload entity storage; CI will re-evaluate them on the updated branch.

alexey-milovidov and others added 3 commits June 6, 2026 11:49
The previous fix protected `PipelineExecutor::allocateCPU` and the
query-execution callers, but the lazy initialization of the resource
manager still observed the same shutdown race.

`Context::getWorkloadClassifier` initializes the resource manager lazily
through `getResourceManager`, which calls `createResourceManager`. That
passed `getWorkloadEntityStorage()` (a raw reference whose owning
`shared_ptr` temporary is destroyed at the end of the full expression),
so no owner was held across `WorkloadResourceManager`'s constructor while
it runs `getAllEntitiesAndSubscribe`. If this first lazy init overlaps
shutdown's reset of `workload_entity_storage`, the storage can be
destroyed while the constructor is still using it — the same
`recursive_mutex` use-after-free this PR set out to fix.

Make `WorkloadResourceManager` take and retain a
`shared_ptr<IWorkloadEntityStorage>`, and have `createResourceManager`
pass `getWorkloadEntityStoragePtr`. The storage is only touched in the
constructor, so holding the owner for the manager's lifetime is more than
sufficient.

Also remove the now-unused reference accessor
`Context::getWorkloadEntityStorage` (the last caller in `Server.cpp`'s
startup `loadEntities` is migrated to `getWorkloadEntityStoragePtr`), so
the shutdown-unsafe accessor can no longer be reintroduced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged latest master (conflict-free) and addressed the remaining unresolved thread — the lazy createResourceManager shutdown lifetime race confirmed by @groeneai — in 1a59f3b.

WorkloadResourceManager now retains a std::shared_ptr<IWorkloadEntityStorage> and createResourceManager passes getWorkloadEntityStoragePtr, so an owner is held across construction even on the lazy init path that can overlap shutdown. The unsafe Context::getWorkloadEntityStorage reference accessor is removed entirely (last caller in Server.cpp migrated). Built clean; the 21 SchedulerWorkloadResourceManager gtests pass.

The only CI red is the Performance Comparison job (logical_functions_small, rand), where both ::old and ::new fail as "slower" — the standard noise pattern; fetch_perf_report.py reports "No significant performance changes detected", as expected for a shared_ptr lifetime change.

@clickhouse-gh

clickhouse-gh Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.50% 84.50% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.20% 77.20% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 104/117 (88.89%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 4 line(s) · Uncovered code

Full report · Diff report

@serxa serxa self-assigned this Jun 9, 2026

@serxa serxa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you!

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 13, 2026
Merged via the queue into ClickHouse:master with commit a56ae9b Jun 13, 2026
166 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants