Fix null deref in RefreshTask::doScheduling under parallel shutdown#105588
Conversation
|
Workflow [PR], commit [6a3d451] Summary: ✅
AI ReviewSummaryThis PR fixes a rare refreshable materialized view shutdown race by returning early when Final VerdictStatus: ✅ Approve |
Address review feedback from clickhouse-gh[bot] on PR ClickHouse#105588: snapshotting `view->getContext()` and `view->getStorageID()` immediately before `lock.unlock()` is not enough for coordinated RMVs, because helpers called earlier in the same function already release the mutex internally. The three problem helpers all release `lock` briefly: - `updateCoordinationState` calls `zookeeper->tryMulti` (line 1261, only when `coordination.coordinated`). - `readZnodesIfNeeded` calls `zookeeper->tryGet` (line 1189, only when `coordination.coordinated`). - `updateDependenciesIfNeeded` calls `set_handle.getDependencies` and `set.findTasks` (line 1050) on every iteration of its retry loop. During any of those windows, a parallel `shutdown()` can acquire the mutex, finish its own `doScheduling(/*is_shutdown=*/true)`, take the second lock inside `shutdown()`, and set `view = nullptr` (RefreshTask.cpp:271). When the original caller re-acquires the lock, any subsequent `view->...` access is a null deref of the same class as STID 1805-36fc. `shutdown()` itself calls `doScheduling()` synchronously without holding the lock, so two concurrent `shutdown()` calls can also both pass the initial `view == nullptr` check and race the same way; one of them can also re-enter `doScheduling()` after the other already nulled `view`. Add `if (!view) return;` guards at every (re)lock site: - At the top of `doScheduling` (handles entry with view already nulled). - After `readZnodesIfNeeded` in `doScheduling`. - After `updateCoordinationState` in the `last_attempt_succeeded` branch (before my earlier snapshot of `view->getContext()`/`view->getStorageID()`). - After `updateDependenciesIfNeeded` in `doScheduling`. - After `lock.lock()` at the end of the unlock window inside `readZnodesIfNeeded`. - After `lock.lock()` at the end of the unlock window inside `updateDependenciesIfNeeded` (so the loop's next iteration does not redo `view->getContext()` on a freed `view`). The existing snapshot pattern at the inner unlock sites is preserved because the snapshot still needs to happen under the lock, before the last unlock, to use across `notifyDependents`. The existing `04273_concurrent_create_or_replace_refresh_mv.sh` regression test still passes (5/5 iterations); peer tests (`04227_create_or_replace_refresh_mv`, `03221_refreshable_matview_progress`, `03258_refreshable_mv_misc`, `03760_refreshable_mv_local`, `02932_refreshable_materialized_views_1`, `02932_refreshable_materialized_views_2`) also continue to pass. The Atomic-database test does not directly exercise the coordinated-RMV path the bot called out (`coordination.coordinated == true` gates the keeper unlock inside `updateCoordinationState`), but the post-unlock guard pattern protects both paths uniformly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Addressed @clickhouse-gh[bot]'s inline review in The bot correctly pointed out that snapshotting Fixup adds
The pre-unlock snapshots are kept because they still need to happen under the lock before the last Validation (debug build, locally):
Caveat: the Atomic-database regression test does not directly exercise the coordinated path the bot called out ( |
| zookeeper = view->getContext()->getZooKeeper(); | ||
| readZnodesIfNeeded(zookeeper, lock); | ||
| chassert(lock.owns_lock()); | ||
| if (!view) |
There was a problem hiding this comment.
@groeneai
If we checked view in line 561, why do we need to check again here?
There was a problem hiding this comment.
The second check is necessary because readZnodesIfNeeded releases and re-acquires the task mutex inside its body. Looking at readZnodesIfNeeded in this file:
- Line 1213:
lock.unlock()before the keepertryGet - Line 1233:
lock.lock()after responses come back - Line 1236-1237: internal
if (!view) return;guard for that first unlock window - Lines 1258-1260: a SECOND
lock.unlock()/lock.lock()aroundnotifyDependents, with no further view check before the function returns
So when control returns to doScheduling at line 581, two things can have happened:
viewwas nulled during 1213-1233 and the function bailed at line 1237 — the caller has no signal (the function returns normally either way).viewwas still set at line 1237 but was nulled during 1258-1260 —viewis null on return, no internal check caught it.
The next dereferences after line 583 — view->getContext()->getRefreshSet().refreshesStopped() at line 759 and view->getContext() / view->getStorageID() at lines 643-644 — would null-deref in either case. The check at line 582 is the bail-out at the relock site, matching the pattern documented at lines 556-560 ("Bail out at every (re)lock site rather than dereferencing a stale view.") and used symmetrically at lines 635-636 (after updateCoordinationState) and lines 756-757 (after updateDependenciesIfNeeded).
If you'd prefer a less repetitive form, I can hoist the bail-out into a helper (e.g. bool checkViewAlive(lock) or have the helpers return a bool to signal "view became null, please bail"), but I went with explicit checks at each relock site for visibility — the invariant is documented in one place (the comment block at 556-560) and the check is small enough to be obvious at each call site.
There was a problem hiding this comment.
OK, better add a comment to clarify that.
There was a problem hiding this comment.
Added in 88168181b715 (3-line comment matching the style of the existing rationale comments at the updateCoordinationState and updateDependenciesIfNeeded bail-outs further down in the same function).
|
@groeneai Check the CI failure and unresolved comments. |
|
@tuanpach addressed the inline view-check question above. CI attribution on
The only PR-actionable failure is the flaky-check timeout, which the new commit addresses. The fresh CI run should now show all of |
|
@groeneai Check the CI failure. |
|
@tuanpach Two items addressed:
The new commit only adds a comment, so the bugfix-validation framework limitation will persist on the next CI run too. Happy to retrigger the AST fuzzer once #100770 lands, or to push a |
|
Resolve the conflict |
8816818 to
475eee6
Compare
|
Rebased onto current master, conflict resolved. Verified locally: |
Pre-PR validation gate(Re-validation after rebasing onto current master to resolve the merge conflict;
Session id: cron:clickhouse-worker-slot-18:20260610-184200 |
475eee6 to
f830d43
Compare
|
@tuanpach Conflict resolved. The branch had drifted far behind master (the "Refreshable MV dependencies overhaul" refactored The refactor moved the racy post-unlock
Net diff is now just Validated locally on a debug build: 311 concurrent Session: cron:clickhouse-worker-slot-19:20260610-191000 |
|
Pushed The previous single-commit rebase landed the file as mode 100644, so the stateless runner could not exec it and reported Verified locally with a master debug build: the test runs in ~3-4s per iteration and prints The "Server died" results on the sanitizer flaky-check jobs are unrelated to this PR: the same artifact appears on ~12 other PRs over the last 3 days (generic |
Pre-PR validation gateTest-only change on
Validated locally on a debug build (26.6.1.1, head b7d7581) with a memory-capped server. Session id: cron:clickhouse-worker-slot-0:20260610-220700 |
|
CI finished on Remaining red checks are pre-existing trunk/infra, unrelated to this change:
No failure on |
|
Removed the Why: The C++ fix is unchanged. Per the bugfix-PR convention, crash fixes that can't be reproduced on the release binary are backed by the CI report link in the description (now added), which also satisfies the new-tests check. |
fca30e2 to
6be1d87
Compare
|
CI on The single red test job is Ready for review. |
|
CI finished on HEAD 6be1d87 (the rebased+squashed single commit = the RefreshTask.cpp fix only). Summary of the run and disposition of every non-green signal:
No PR-caused failures. Moving to review. |
|
@groeneai Check the failure in CI. Any PR fixing it? |
shutdown() runs doScheduling(is_shutdown=true) synchronously without holding
the task mutex (it deactivates the background scheduling_task first, then calls
doScheduling directly). Two concurrent shutdowns of the same refreshable MV can
therefore both reach doScheduling, and one can null `view` (shutdown() sets
view = nullptr after its final notifyDependents) while the other is still
running. The running doScheduling then dereferences a null `view` via
view->getContext()/view->getStorageID(), crashing with SIGSEGV (the weak_ptr
inside StorageMaterializedView at offset 0x140).
After the "Refreshable MV dependencies overhaul" refactor on master, the racy
post-unlock view dereference is centralized in notifyDependentsIfNeeded(), which
is reached from readZnodesIfNeeded() and updateCoordinationState() after they
release the mutex internally. Two guards close the window:
* doScheduling(): bail at entry if `view` is already null (a parallel
shutdown may have nulled it before this call started).
* notifyDependentsIfNeeded(): bail if `view` is null, and snapshot
getContext()/getStorageID() before unlocking so the unlocked
notifyDependents() call can't dereference a view that a parallel shutdown
nulls in the meantime. shutdown() does its own final notifyDependents(),
so skipping it here loses nothing.
This is the null-view SIGSEGV seen under stress with the thread fuzzer
(STID 1805-36fc). It does not address the separate
"Unexpected exception in refresh scheduling" LOGICAL_ERROR (ClickHouse#106737,
STID 2508-3e7b), which is the catch-all re-raising a thrown inner exception
from the dependency path and is tracked separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6be1d87 to
1149ae2
Compare
|
The only red on the previous head (
That run was 6 days old. I rebased the fix onto current master (no drift in |
|
CI finished on
This PR's C++ diff is confined to the |
|
Merge master to this branch |
|
@groeneai Hello? |
…h-task-shutdown-view-race-105548
|
@nikitamikhaylov Sorry for the delay. @tuanpach approved on 06-22 and in the same minute asked me to merge master into the branch; I just did that.
Should be ready to merge once the fresh CI on |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 12/14 (85.71%) | Lost baseline coverage: none · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix a rare server crash when refreshable materialized views are dropped or replaced concurrently. A null pointer was dereferenced in
RefreshTask::doSchedulingwhen another thread set the task'sviewtonullptrwhile the scheduling code had temporarily released its mutex.Description
CI report (the crash this fixes): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105548&sha=1c33e04b3075c798f03b5cac5856f04be52cfab4&name_0=PR&name_1=Stress%20test%20%28amd_debug%29
Fatal log: https://s3.amazonaws.com/clickhouse-test-reports/PRs/105548/1c33e04b3075c798f03b5cac5856f04be52cfab4/stress_test_amd_debug/fatal.log
SIGSEGV in
RefreshTask::doSchedulingon PR #105548 stress (STID1805-36fc),Address: 0x140(theWithContextImpl::contextweak_ptr insideStorageMaterializedView), i.e.viewwasnullptrat the deref.Root cause:
shutdown()clearsviewunder the mutex and then callsdoScheduling()synchronously.doScheduling()and its helpers (readZnodesIfNeeded,updateCoordinationState,collectDependencyStates,notifyDependentsIfNeeded) release the task mutex at several points to talk to keeper or to otherRefreshSettasks. A second concurrentshutdown()can win the race and nullviewduring one of those unlocked windows, after which the original caller dereferences a staleview(view->getContext()/view->getStorageID()).Fix: re-check
viewafter every relock site and snapshot the view accessors before releasing the mutex, matching the pattern already used bywait()andwaitForLatestTargetTable()("Can't useviewhere because shutdown() may unset it in parallel with us."). The execution path (executeRefreshand below) is unaffected:shutdown()callsexecution_task->deactivate()before clearingview, so it never observes a null view.No regression test is included. This is a probabilistic shutdown race (one CIDB hit in 30 days, only under stress with the thread fuzzer), and
px != 0is a debug-only assertion that is compiled out of the release binary. Bugfix validation downloads the unfixed master HEAD release binary and runs a changed test once with no thread fuzzer, so a race reproducer cannot fail there: it was verified to always pass on that binary, so bugfix validation reports "Failed to reproduce the bug". The crash is documented by the CI report linked above.Version info
26.7.1.129