iframe-proxy

tuanpach · 2026-04-01T03:13:38Z

During server shutdown, ContextSharedPart destroys workload_entity_storage via unique_ptr::reset while a query pipeline may still be running PipelineExecutor::allocateCPU, which holds a raw reference to the storage obtained from getWorkloadEntityStorage. This caused a use-after-free of the recursive_mutex inside WorkloadEntityStorageBase, detected by TSan as a data race in TraceEvent<EventUnlock> / TraceMutexUnlock.

Fix:

Change workload_entity_storage in ContextSharedPart from unique_ptr to shared_ptr.
Add Context::getWorkloadEntityStoragePtr returning a shared_ptr copy.
In PipelineExecutor::allocateCPU, hold the shared_ptr for the duration of the call so the storage object cannot be destroyed underneath it.

Test: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101060&sha=34f9537c4f02bdf3a862b3cdcad591081052235c&name_0=PR&name_1=Stress+test+%28amd_debug%29
#101060

Job log: https://pastila.clickhouse.com/?0000168c/a152326ece24275e5b4fe9dd6ab13982#ZZO9TJyd92smEO3/A1NiNQ==GCM

Failure: https://pastila.clickhouse.com/?001fdf85/ffb73f64f70fed2915e833076e3ffa75#JCRYrJgw6/Sz2vqG4S5AUw==GCM

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

During server shutdown, `ContextSharedPart` destroys `workload_entity_storage` via `unique_ptr::reset` while a query pipeline may still be running `PipelineExecutor::allocateCPU`, which holds a raw reference to the storage obtained from `getWorkloadEntityStorage`. This caused a use-after-free of the `recursive_mutex` inside `WorkloadEntityStorageBase`, detected by TSan as a data race in `TraceEvent<EventUnlock>` / `TraceMutexUnlock`. Fix: - Change `workload_entity_storage` in `ContextSharedPart` from `unique_ptr` to `shared_ptr`. - Add `Context::getWorkloadEntityStoragePtr` returning a `shared_ptr` copy. - In `PipelineExecutor::allocateCPU`, hold the `shared_ptr` for the duration of the call so the storage object cannot be destroyed underneath it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

clickhouse-gh · 2026-04-01T03:14:21Z

`getWorkloadEntityStoragePtr` returned `shared->workload_entity_storage` without holding `shared->mutex`, while `shutdown` moved from that same field under the mutex at the same time. Reading and writing one `std::shared_ptr` object concurrently from two threads is a data race. Fix: acquire `shared->mutex` (shared/read lock) before copying the `shared_ptr` out, matching the exclusive lock held by the write side in `shutdown`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous fix only guarded the read of `shared->workload_entity_storage` but the write inside `callOnce` was still unprotected. If `shutdown` ran concurrently with the first call to `getWorkloadEntityStoragePtr`, the `callOnce` lambda could write to `workload_entity_storage` at the same time as `shutdown` moved from it under `shared->mutex` — another data race on the same `std::shared_ptr` object. Fix: create the storage object in a local variable first (outside the mutex, since construction can be slow), then assign to `shared->workload_entity_storage` while holding `shared->mutex`. Both read and write of the field are now guarded by `shared->mutex`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

alexey-milovidov · 2026-04-07T00:28:52Z

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

alexey-milovidov · 2026-04-07T19:47:10Z

The failures of "Flaky check" in "functions_bad_arguments" will be fixed by #101994.

alexey-milovidov · 2026-04-08T02:58:22Z

@groeneai, make the test 02015_async_inserts_4 repeatable, so it runs well inside the flaky check. Send a new PR with the fix, link it here.

The test creates users and roles with hardcoded global names (u_02015_allowed, u_02015_denied, role_02015). When the flaky check runs this test multiple times with parallel workers, one instance's DROP USER races with another instance's REVOKE/CREATE, causing UNKNOWN_ROLE errors. Fix: suffix user and role names with $CLICKHOUSE_DATABASE (unique per test invocation) so parallel instances operate on independent global objects. Reproduction: --test-runs 10 -j 4 fails 6/10 without fix, 50/50 passes with fix at -j 8. Ref: ClickHouse#101447

groeneai · 2026-04-08T03:53:11Z

@alexey-milovidov — fix PR created: #102019

Root cause: The test uses hardcoded global user/role names (u_02015_allowed, u_02015_denied, role_02015). When the flaky check runs the test multiple times with parallel workers, one instance's DROP USER IF EXISTS races with another instance's REVOKE ALL / CREATE USER, causing UNKNOWN_ROLE errors.

Fix: Suffix all user/role names with $CLICKHOUSE_DATABASE (unique per test invocation) so parallel instances don't interfere.

Verified locally: 6/10 fail without fix (-j 4), 50/50 pass with fix (-j 8).

alexey-milovidov · 2026-04-08T21:05:41Z

The flaky test failures in this PR (01605_drop_settings_profile_while_assigned, 01418_custom_settings, 02015_async_inserts_4) are caused by non-unique access entity names and will be resolved by #102131.

Split flaky check and targeted check into separate roles: Flaky check — runs only new/modified tests from the PR with thread fuzzer enabled. Tests run in parallel with themselves (shared queue) to ensure new tests are parallel-safe under randomized thread ordering. Targeted check — runs previously failed and coverage-relevant tests with per-worker partitioning (--no-self-parallel) so the same test never runs concurrently on multiple workers, avoiding conflicts on shared resources (hardcoded users, roles, settings profiles, etc.). Both use --test-runs 50. Changes: - Add --no-self-parallel flag to clickhouse-test: partitions tests by name hash so all copies go to the same worker - Flaky check: only changed tests, thread fuzzer, shared queue - Targeted check: relevant tests, no thread fuzzer, per-worker queues, 50 runs, single clickhouse-test invocation - Restore original thread fuzzer parameters (reverts reductions from 06c945a and 9790435) - Add amd_binary to flaky check, use AMD_LARGE for targeted debug - Targeted check uses --flaky-check --no-self-parallel PRs affected by tests running in parallel with themselves: #102015 #101093 #99653 #101503 #101108 #101884 #100746 #101447 #100941 #101099 #100203 #96130 #102038

Split flaky check and targeted check into separate roles: Flaky check — runs only new/modified tests from the PR with thread fuzzer enabled. Tests can run in parallel with themselves (shared queue) to ensure new tests are parallel-safe. Targeted check — runs previously failed and coverage-relevant tests with per-worker partitioning (--no-self-parallel) so the same test never runs concurrently on multiple workers, avoiding conflicts on shared resources (hardcoded users, roles, settings profiles, etc.). Both use --test-runs 50. Changes: - Add --no-self-parallel flag to clickhouse-test: partitions tests by name hash so all copies go to the same worker - Flaky check: only changed tests, thread fuzzer, shared queue - Targeted check: relevant tests, no thread fuzzer, per-worker partitioning, 50 runs, single clickhouse-test invocation - Restore original thread fuzzer parameters - Change targeted check to arm_asan_ubsan on ARM_LARGE PRs affected by tests running in parallel with themselves: #102015 #101093 #99653 #101503 #101108 #101884 #100746 #101447 #100941 #101099 #100203 #96130

…e-lifetime-race

The `SHUTDOWN(... workload_entity_storage, stopWatching())` path read the `shared->workload_entity_storage` `shared_ptr` without holding `shared->mutex`, while `getWorkloadEntityStoragePtr` may concurrently initialize it under the same mutex. A concurrent read/write of the same `std::shared_ptr` object is a data race. Copy the pointer under a shared lock and call `stopWatching` on the local copy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`getWorkloadEntityStorage` returns a raw reference whose owning `shared_ptr` is discarded immediately, so callers running concurrently with shutdown could still observe a dangling reference once the storage is destroyed. Migrate the callers that run on query-execution paths (`ProcessList::insert`, the `CREATE`/`DROP WORKLOAD`/`RESOURCE` interpreters, `system.workloads`, `system.resources`, and the `DiskObjectStorage` subscription) to `getWorkloadEntityStoragePtr`, holding the `shared_ptr` for the duration of the access. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-03T23:35:11Z

Continued work on this PR:

Merged master — the branch was ~10931 commits behind (merge base from 2026-04-17). Clean merge, no conflicts; all original PR changes intact.
Addressed the two remaining clickhouse-gh review threads:
- Shutdown read race — the SHUTDOWN(... workload_entity_storage, stopWatching()) path read the shared_ptr without shared->mutex while getWorkloadEntityStoragePtr initializes it under the same mutex. Now copies the pointer under SharedLockGuard into a local and calls stopWatching on the copy (55d51c7aba0).
- Partial fix scoped only to PipelineExecutor::allocateCPU — migrated the remaining query-path callers of getWorkloadEntityStorage to getWorkloadEntityStoragePtr, holding the shared_ptr for the call duration: ProcessList::insert, the CREATE/DROP WORKLOAD/RESOURCE interpreters, system.workloads, system.resources, and the DiskObjectStorage subscription. The only remaining reference-accessor caller is createResourceManager (startup-time, long-lived reference member — a separate ownership question) (4cc5bf0027b).

Verification: built on ARM (clean, -Werror), and ran 03232_workload_create_and_drop, 03232_workloads_and_resources (ref-match) and 03232_resource_create_and_drop (matches reference modulo DB-name prefix) against a fresh binary — all pass, exercising the migrated interpreter and system.* paths.

The previously-reported CI failures (04057/04060 transaction tests and the Inconsistent AST formatting AST-fuzzer assertion) are from the stale pre-merge commit and are unrelated to workload entity storage; CI will re-evaluate them on the updated branch.

…e-lifetime-race

The previous fix protected `PipelineExecutor::allocateCPU` and the query-execution callers, but the lazy initialization of the resource manager still observed the same shutdown race. `Context::getWorkloadClassifier` initializes the resource manager lazily through `getResourceManager`, which calls `createResourceManager`. That passed `getWorkloadEntityStorage()` (a raw reference whose owning `shared_ptr` temporary is destroyed at the end of the full expression), so no owner was held across `WorkloadResourceManager`'s constructor while it runs `getAllEntitiesAndSubscribe`. If this first lazy init overlaps shutdown's reset of `workload_entity_storage`, the storage can be destroyed while the constructor is still using it — the same `recursive_mutex` use-after-free this PR set out to fix. Make `WorkloadResourceManager` take and retain a `shared_ptr<IWorkloadEntityStorage>`, and have `createResourceManager` pass `getWorkloadEntityStoragePtr`. The storage is only touched in the constructor, so holding the owner for the manager's lifetime is more than sufficient. Also remove the now-unused reference accessor `Context::getWorkloadEntityStorage` (the last caller in `Server.cpp`'s startup `loadEntities` is migrated to `getWorkloadEntityStoragePtr`), so the shutdown-unsafe accessor can no longer be reintroduced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e-lifetime-race

alexey-milovidov · 2026-06-07T22:29:02Z

Merged latest master (conflict-free) and addressed the remaining unresolved thread — the lazy createResourceManager shutdown lifetime race confirmed by @groeneai — in 1a59f3b.

WorkloadResourceManager now retains a std::shared_ptr<IWorkloadEntityStorage> and createResourceManager passes getWorkloadEntityStoragePtr, so an owner is held across construction even on the lazy init path that can overlap shutdown. The unsafe Context::getWorkloadEntityStorage reference accessor is removed entirely (last caller in Server.cpp migrated). Built clean; the 21 SchedulerWorkloadResourceManager gtests pass.

The only CI red is the Performance Comparison job (logical_functions_small, rand), where both ::old and ::new fail as "slower" — the standard noise pattern; fetch_perf_report.py reports "No significant performance changes detected", as expected for a shared_ptr lifetime change.

clickhouse-gh · 2026-06-08T02:08:12Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 104/117 (88.89%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 4 line(s) · Uncovered code

Full report · Diff report

serxa

LGTM, Thank you!

clickhouse-gh Bot added the pr-ci label Apr 1, 2026

clickhouse-gh Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread src/Interpreters/Context.cpp

tuanpach and others added 2 commits April 2, 2026 00:04

Merge branch 'master' into fix-workload-storage-lifetime-race

32ed57c

clickhouse-gh Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/Interpreters/Context.cpp Outdated

Merge remote-tracking branch 'origin/master' into tmp-101447

fd287e4

clickhouse-gh Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread src/Interpreters/Context.cpp

groeneai mentioned this pull request Apr 8, 2026

Fix flaky test 02015_async_inserts_4 #102019

Merged

1 task

tuanpach and others added 4 commits April 17, 2026 07:39

Merge branch 'master' into fix-workload-storage-lifetime-race

70a6c1d

Merge remote-tracking branch 'origin/master' into fix-workload-storag…

3548119

…e-lifetime-race

alexey-milovidov and others added 3 commits June 6, 2026 11:49

Merge remote-tracking branch 'origin/master' into fix-workload-storag…

55b38ff

…e-lifetime-race

Merge remote-tracking branch 'origin/master' into fix-workload-storag…

6278ace

…e-lifetime-race

serxa self-assigned this Jun 9, 2026

serxa approved these changes Jun 9, 2026

View reviewed changes

alexey-milovidov added this pull request to the merge queue Jun 13, 2026

Merged via the queue into ClickHouse:master with commit a56ae9b Jun 13, 2026
166 checks passed

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 13, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

tuanpach commented Apr 1, 2026 • edited by alexey-milovidov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

alexey-milovidov commented Apr 7, 2026

Uh oh!

Uh oh!

alexey-milovidov commented Apr 7, 2026

Uh oh!

Uh oh!

alexey-milovidov commented Apr 8, 2026

Uh oh!

groeneai commented Apr 8, 2026

Uh oh!

alexey-milovidov commented Apr 8, 2026

Uh oh!

alexey-milovidov commented Jun 3, 2026

Uh oh!

alexey-milovidov commented Jun 7, 2026

Uh oh!

clickhouse-gh Bot commented Jun 8, 2026

LLVM Coverage Report

Uh oh!

serxa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tuanpach commented Apr 1, 2026 •

edited by alexey-milovidov

Loading

clickhouse-gh Bot commented Apr 1, 2026 •

edited

Loading