Fix null deref in RefreshTask::doScheduling under parallel shutdown by groeneai · Pull Request #105588 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix null deref in RefreshTask::doScheduling under parallel shutdown#105588

Merged
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-refresh-task-shutdown-view-race-105548
Jun 26, 2026
Merged

Fix null deref in RefreshTask::doScheduling under parallel shutdown#105588
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-refresh-task-shutdown-view-race-105548

Conversation

@groeneai

@groeneai groeneai commented May 21, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a rare server crash when refreshable materialized views are dropped or replaced concurrently. A null pointer was dereferenced in RefreshTask::doScheduling when another thread set the task's view to nullptr while the scheduling code had temporarily released its mutex.

Description

CI report (the crash this fixes): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105548&sha=1c33e04b3075c798f03b5cac5856f04be52cfab4&name_0=PR&name_1=Stress%20test%20%28amd_debug%29
Fatal log: https://s3.amazonaws.com/clickhouse-test-reports/PRs/105548/1c33e04b3075c798f03b5cac5856f04be52cfab4/stress_test_amd_debug/fatal.log

SIGSEGV in RefreshTask::doScheduling on PR #105548 stress (STID 1805-36fc), Address: 0x140 (the WithContextImpl::context weak_ptr inside StorageMaterializedView), i.e. view was nullptr at the deref.

Root cause: shutdown() clears view under the mutex and then calls doScheduling() synchronously. doScheduling() and its helpers (readZnodesIfNeeded, updateCoordinationState, collectDependencyStates, notifyDependentsIfNeeded) release the task mutex at several points to talk to keeper or to other RefreshSet tasks. A second concurrent shutdown() can win the race and null view during one of those unlocked windows, after which the original caller dereferences a stale view (view->getContext() / view->getStorageID()).

Fix: re-check view after every relock site and snapshot the view accessors before releasing the mutex, matching the pattern already used by wait() and waitForLatestTargetTable() ("Can't use view here because shutdown() may unset it in parallel with us."). The execution path (executeRefresh and below) is unaffected: shutdown() calls execution_task->deactivate() before clearing view, so it never observes a null view.

No regression test is included. This is a probabilistic shutdown race (one CIDB hit in 30 days, only under stress with the thread fuzzer), and px != 0 is a debug-only assertion that is compiled out of the release binary. Bugfix validation downloads the unfixed master HEAD release binary and runs a changed test once with no thread fuzzer, so a race reproducer cannot fail there: it was verified to always pass on that binary, so bugfix validation reports "Failed to reproduce the bug". The crash is documented by the CI report linked above.

Version info

  • Merged into: 26.7.1.129

@groeneai

Copy link
Copy Markdown
Contributor Author

@PedroTadim PedroTadim added the can be tested Allows running workflows for external contributors label May 22, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [6a3d451]

Summary:


AI Review

Summary

This PR fixes a rare refreshable materialized view shutdown race by returning early when view was already cleared and by snapshotting Context/StorageID before notifying dependents outside the task mutex. I rechecked the current diff against the prior review threads; the earlier concerns are either addressed, resolved as out of scope, or no longer present in the current code. I did not find a new blocker or major issue.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 22, 2026
Comment thread src/Storages/MaterializedView/RefreshTask.cpp Outdated
groeneai added a commit to groeneai/ClickHouse that referenced this pull request May 22, 2026
Address review feedback from clickhouse-gh[bot] on PR ClickHouse#105588: snapshotting
`view->getContext()` and `view->getStorageID()` immediately before
`lock.unlock()` is not enough for coordinated RMVs, because helpers called
earlier in the same function already release the mutex internally.

The three problem helpers all release `lock` briefly:

  - `updateCoordinationState` calls `zookeeper->tryMulti` (line 1261, only
    when `coordination.coordinated`).
  - `readZnodesIfNeeded` calls `zookeeper->tryGet` (line 1189, only when
    `coordination.coordinated`).
  - `updateDependenciesIfNeeded` calls `set_handle.getDependencies` and
    `set.findTasks` (line 1050) on every iteration of its retry loop.

During any of those windows, a parallel `shutdown()` can acquire the mutex,
finish its own `doScheduling(/*is_shutdown=*/true)`, take the second lock
inside `shutdown()`, and set `view = nullptr` (RefreshTask.cpp:271). When
the original caller re-acquires the lock, any subsequent `view->...`
access is a null deref of the same class as STID 1805-36fc.

`shutdown()` itself calls `doScheduling()` synchronously without holding
the lock, so two concurrent `shutdown()` calls can also both pass the
initial `view == nullptr` check and race the same way; one of them can
also re-enter `doScheduling()` after the other already nulled `view`.

Add `if (!view) return;` guards at every (re)lock site:

  - At the top of `doScheduling` (handles entry with view already nulled).
  - After `readZnodesIfNeeded` in `doScheduling`.
  - After `updateCoordinationState` in the `last_attempt_succeeded` branch
    (before my earlier snapshot of `view->getContext()`/`view->getStorageID()`).
  - After `updateDependenciesIfNeeded` in `doScheduling`.
  - After `lock.lock()` at the end of the unlock window inside
    `readZnodesIfNeeded`.
  - After `lock.lock()` at the end of the unlock window inside
    `updateDependenciesIfNeeded` (so the loop's next iteration does not
    redo `view->getContext()` on a freed `view`).

The existing snapshot pattern at the inner unlock sites is preserved
because the snapshot still needs to happen under the lock, before the
last unlock, to use across `notifyDependents`.

The existing `04273_concurrent_create_or_replace_refresh_mv.sh`
regression test still passes (5/5 iterations); peer tests
(`04227_create_or_replace_refresh_mv`, `03221_refreshable_matview_progress`,
`03258_refreshable_mv_misc`, `03760_refreshable_mv_local`,
`02932_refreshable_materialized_views_1`, `02932_refreshable_materialized_views_2`)
also continue to pass. The Atomic-database test does not directly exercise
the coordinated-RMV path the bot called out (`coordination.coordinated == true`
gates the keeper unlock inside `updateCoordinationState`), but the
post-unlock guard pattern protects both paths uniformly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Addressed @clickhouse-gh[bot]'s inline review in c4388625b98a.

The bot correctly pointed out that snapshotting view->getContext() and view->getStorageID() immediately before lock.unlock(); was not enough: updateCoordinationState, readZnodesIfNeeded, and updateDependenciesIfNeeded already release the mutex internally before my snapshot runs. In the coordinated path that internal unlock window is large enough for a parallel shutdown() to finish doScheduling(/*is_shutdown=*/true), take the second lock in shutdown(), and null view (RefreshTask.cpp:271) before the original thread relocks.

Fixup adds if (!view) return; guards at every (re)lock site:

  • Top of doScheduling (entry-with-null-view, possible when one shutdown finished completely before the second one entered doScheduling).
  • After readZnodesIfNeeded in doScheduling.
  • After updateCoordinationState in the last_attempt_succeeded branch, before the snapshot.
  • After updateDependenciesIfNeeded in doScheduling, before view->getContext()->getRefreshSet().refreshesStopped().
  • After lock.lock(); inside readZnodesIfNeeded.
  • After lock.lock(); inside updateDependenciesIfNeeded (so the loop's next iteration does not redo view->getContext() on a nulled view).

The pre-unlock snapshots are kept because they still need to happen under the lock before the last lock.unlock(); so the unlocked work uses a stable copy.

Validation (debug build, locally):

  • 04273_concurrent_create_or_replace_refresh_mv (the regression test added in the first commit): 5/5 OK.
  • 04227_create_or_replace_refresh_mv, 03221_refreshable_matview_progress, 03258_refreshable_mv_misc, 03760_refreshable_mv_local, 02932_refreshable_materialized_views_1, 02932_refreshable_materialized_views_2: all OK.

Caveat: the Atomic-database regression test does not directly exercise the coordinated path the bot called out (coordination.coordinated == true gates the keeper unlock inside updateCoordinationState). The post-unlock guard pattern protects the coordinated and uncoordinated paths uniformly; setting up Replicated database + replicated inner table in a stateless test for full coverage would be a separate effort.

cc @al13n321 @tuanpach

zookeeper = view->getContext()->getZooKeeper();
readZnodesIfNeeded(zookeeper, lock);
chassert(lock.owns_lock());
if (!view)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@groeneai
If we checked view in line 561, why do we need to check again here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second check is necessary because readZnodesIfNeeded releases and re-acquires the task mutex inside its body. Looking at readZnodesIfNeeded in this file:

  • Line 1213: lock.unlock() before the keeper tryGet
  • Line 1233: lock.lock() after responses come back
  • Line 1236-1237: internal if (!view) return; guard for that first unlock window
  • Lines 1258-1260: a SECOND lock.unlock() / lock.lock() around notifyDependents, with no further view check before the function returns

So when control returns to doScheduling at line 581, two things can have happened:

  1. view was nulled during 1213-1233 and the function bailed at line 1237 — the caller has no signal (the function returns normally either way).
  2. view was still set at line 1237 but was nulled during 1258-1260 — view is null on return, no internal check caught it.

The next dereferences after line 583 — view->getContext()->getRefreshSet().refreshesStopped() at line 759 and view->getContext() / view->getStorageID() at lines 643-644 — would null-deref in either case. The check at line 582 is the bail-out at the relock site, matching the pattern documented at lines 556-560 ("Bail out at every (re)lock site rather than dereferencing a stale view.") and used symmetrically at lines 635-636 (after updateCoordinationState) and lines 756-757 (after updateDependenciesIfNeeded).

If you'd prefer a less repetitive form, I can hoist the bail-out into a helper (e.g. bool checkViewAlive(lock) or have the helpers return a bool to signal "view became null, please bail"), but I went with explicit checks at each relock site for visibility — the invariant is documented in one place (the comment block at 556-560) and the check is small enough to be obvious at each call site.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, better add a comment to clarify that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 88168181b715 (3-line comment matching the style of the existing rationale comments at the updateCoordinationState and updateDependenciesIfNeeded bail-outs further down in the same function).

@tuanpach tuanpach self-assigned this May 25, 2026
@tuanpach

Copy link
Copy Markdown
Member

@groeneai Check the CI failure and unresolved comments.

@groeneai

Copy link
Copy Markdown
Contributor Author

@tuanpach addressed the inline view-check question above. CI attribution on c4388625b98a (pre-cc6c41f128ff):

Check Status Attribution
Bugfix validation (functional tests) FAIL Same root cause as the flaky-check timeout below — the regression test exceeds the 180s budget under ASan+UBSan. Addressed in cc6c41f128ff.
Stateless tests (amd_asan_ubsan, flaky check)04273_concurrent_create_or_replace_refresh_mv FAIL × 5 of 50 repeats PR-actionable. The test does 60 CREATE OR REPLACE MATERIALIZED VIEW … REFRESH … + SYSTEM WAIT VIEW iterations and approaches the 180s TEST_MAX_RUN_TIME_IN_SECONDS flaky-check ceiling under sanitizer + debug. Added Tags: long in cc6c41f128ff (matches 04102_hash_join_stress_test.sql); this opts out of the 180s budget (tests/clickhouse-test:3179-3183) and lowers the flaky-check repeat count via long_test_runs_ratio. The test's race window is unchanged.
Integration tests × 5 variants — test_mysql_protocol/test.py::test_mysql_dotnet_client FAIL Chronic infra flake, not PR-caused. 30-day CIDB: 96 distinct PRs, 94 master hits, 594 total events. Same Received unknown response X for SSLRequest shape across all five variants.
Stateless tests (arm_asan_ubsan, azure, sequential)Server died FAIL Chronic infra flake. 30-day CIDB: 814 PRs, 18 master hits, 2601 total. Generic ARM ASan azure-storage shape.
Finish WorkflowPost Hooks FAIL Chronic CIDB log-export infra flake (play.clickhouse.com:443).

The only PR-actionable failure is the flaky-check timeout, which the new commit addresses. The fresh CI run should now show all of 04273 clean on the flaky check and on the Bugfix validation budget, and the remaining red checks should be the chronic-flake set above which are not blocking.

@tuanpach

Copy link
Copy Markdown
Member

@groeneai Check the CI failure.

@groeneai

Copy link
Copy Markdown
Contributor Author

@tuanpach Two items addressed:

  1. Inline comment ask (line 582): added in 88168181b715 — three lines explaining that readZnodesIfNeeded releases the task mutex twice internally (around keeper->tryGet and around notifyDependents). Style matches the existing rationale comments at the updateCoordinationState and updateDependenciesIfNeeded bail-outs lower in the same function.

  2. CI on cc6c41f128ff — attribution unchanged from my reply yesterday; both remaining red checks are non-PR-actionable:

    Check Status Attribution
    AST fuzzer (amd_debug, targeted) FAIL Chronic trunkSTID 4697-4326 ("Column identifier ... is already registered") on a fuzzed ALTER TABLE … UPDATE … IN (SELECT … QUALIFY … INTERSECT ALL SELECT … QUALIFY …). 28+ failures across 21+ unrelated PRs in last 14 days. Fix in flight in Fix LOGICAL_ERROR "Column identifier is already registered" in UNION ALL #100770 (open, mergeable). Not related to RefreshTask.
    Bugfix validation (functional tests) FAIL Framework limitation, not a regression04273_concurrent_create_or_replace_refresh_mv ran clean on master HEAD ([ OK ] 2.25 sec on m8g.4xlarge ARM at 26251069c810…), so the bugfix-validation harness reports "Failed to reproduce the bug" since the test was expected to fail on master without the fix. The underlying bug is a non-deterministic shutdown race that requires ASan + many parallel CREATE OR REPLACE … REFRESH iterations to trigger reliably (STID 1805-36fc originally surfaced in a stress run, not a deterministic single-test execution).
    Mergeable Check / PR / Report messages FAIL Cascade from the two above; no separate failure.

The new commit only adds a comment, so the bugfix-validation framework limitation will persist on the next CI run too. Happy to retrigger the AST fuzzer once #100770 lands, or to push a Tags: long follow-up if you want to keep the bugfix-validation harness from re-running this test 60 iterations on every retry — let me know your preference.

@tuanpach

tuanpach commented Jun 9, 2026

Copy link
Copy Markdown
Member

@groeneai

Resolve the conflict

@groeneai groeneai force-pushed the groeneai/fix-refresh-task-shutdown-view-race-105548 branch from 8816818 to 475eee6 Compare June 10, 2026 20:00
@groeneai

Copy link
Copy Markdown
Contributor Author

Rebased onto current master, conflict resolved. RefreshTask.cpp was refactored upstream (the unlock/notify cycle moved into notifyDependentsIfNeeded, and updateDependenciesIfNeeded became collectDependencyStates/collectDependencyStatesUnlocked), so the view-null guards are ported onto the new structure: bail after each relock site and snapshot the view accessors before unlocking. The branch is now a single clean commit (475eee6).

Verified locally: 04273_concurrent_create_or_replace_refresh_mv passes (20/20 with randomization disabled) and the server stays up across 30+ concurrent CREATE OR REPLACE iterations; the rest of the refreshable-MV suite (02932_1/2/3, 03221, 03327, 03362, 03760, 04201) passes. PR is now MERGEABLE.

@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate

(Re-validation after rebasing onto current master to resolve the merge conflict; RefreshTask.cpp was refactored upstream, so the fix was ported onto the new helper structure.)

# Question Answer
a Deterministic repro? The crash is a non-deterministic shutdown race (STID 1805-36fc). It is forced by 04273_concurrent_create_or_replace_refresh_mv.sh: 6 parallel sessions running CREATE OR REPLACE MATERIALIZED VIEW ... REFRESH ... (each REPLACE drops the old MV, triggering shutdown()), with SYSTEM WAIT VIEW so each shutdown enters the last_attempt_succeeded branch where the original SIGSEGV occurred.
b Root cause explained? shutdown() clears view under the mutex, then calls doScheduling() synchronously. doScheduling() and its helpers (readZnodesIfNeeded, updateCoordinationState, collectDependencyStates, notifyDependentsIfNeeded) release the task mutex to talk to keeper / other RefreshSet tasks. A second concurrent shutdown() can null view during one of those unlocked windows; the original caller then dereferences a stale view (view->getContext()/view->getStorageID()) and crashes (read at offset 0x140, the WithContextImpl weak_ptr).
c Fix matches root cause? Yes. Re-check view after every relock site and snapshot the view accessors into locals before lock.unlock(), so the unlocked region never dereferences a possibly-null view. This is the exact pattern wait()/waitForLatestTargetTable() already document and use. Not a symptom guard: the checks sit precisely at the relock boundaries that the race exploits.
d Test intent preserved / new tests added? New regression test 04273_concurrent_create_or_replace_refresh_mv.sh added (unchanged from the previously-reviewed version, Tags: long retained). It exercises the concurrent-shutdown path directly.
e Both directions demonstrated? Pre-fix: this code path SIGSEGV'd in CI stress (STID 1805-36fc). Post-fix: 04273 passes 20/20 with randomization disabled and the server stays up across 30+ concurrent CREATE OR REPLACE iterations; the rest of the refreshable-MV suite (02932_1/2/3, 03221, 03327, 03362, 03760, 04201) passes.
f Fix is general, not a narrow patch? Audited every view deref reachable from doScheduling after a mutex release, including master's new helpers notifyDependentsIfNeeded and collectDependencyStatesUnlocked, and guarded/snapshotted each. The execution path (executeRefresh and below) is deliberately untouched and is safe by construction: shutdown() calls execution_task->deactivate() before clearing view, so it never observes a null view.

Session id: cron:clickhouse-worker-slot-18:20260610-184200

@groeneai groeneai force-pushed the groeneai/fix-refresh-task-shutdown-view-race-105548 branch from 475eee6 to f830d43 Compare June 10, 2026 20:44
@groeneai

Copy link
Copy Markdown
Contributor Author

@tuanpach Conflict resolved. The branch had drifted far behind master (the "Refreshable MV dependencies overhaul" refactored RefreshTask.cpp heavily), so I rebased onto current master and re-applied the fix on the new structure as a single clean commit (f830d43).

The refactor moved the racy post-unlock view dereference into notifyDependentsIfNeeded(), so the fix is now just two guards:

  • doScheduling(): bail at entry if view is already null (a parallel shutdown() may have nulled it before the call started).
  • notifyDependentsIfNeeded(): bail if view is null, and snapshot getContext()/getStorageID() before lock.unlock() so the unlocked notifyDependents() can't dereference a view that a parallel shutdown() nulls in the meantime. shutdown() does its own final notifyDependents(), so skipping it here loses nothing.

Net diff is now just RefreshTask.cpp (+14 lines) plus the regression test. PR is MERGEABLE again.

Validated locally on a debug build: 311 concurrent CREATE OR REPLACE MATERIALIZED VIEW ... REFRESH + DROP statements across 13 parallel sessions (each triggering RefreshTask::shutdown()), 0 crashes, 0 "Unexpected exception in refresh scheduling" in the server log.

Session: cron:clickhouse-worker-slot-19:20260610-191000

Comment thread tests/queries/0_stateless/04273_concurrent_create_or_replace_refresh_mv.sh Outdated
@groeneai

Copy link
Copy Markdown
Contributor Author

Pushed 5d0e14f: restore the executable bit (mode 100755) on 04273_concurrent_create_or_replace_refresh_mv.sh.

The previous single-commit rebase landed the file as mode 100644, so the stateless runner could not exec it and reported return code 126 (Permission denied) instead of running the test (arm_binary parallel; flaky/targeted jobs showed it as UNKNOWN). No content change, only the mode bit.

Verified locally with a master debug build: the test runs in ~3-4s per iteration and prints OK. The only iteration failures under randomization were a local tzdata artifact (session_timezone='zoneinfo/UTC' from get_localzone() on this host), which CIDB confirms does not occur in CI. The 180s flaky-check limit is moot here anyway: the test is tagged long, which exempts it from the per-run time check, and each run is well under the limit regardless.

The "Server died" results on the sanitizer flaky-check jobs are unrelated to this PR: the same artifact appears on ~12 other PRs over the last 3 days (generic TIMEOUT_EXCEEDED in MergeTreeSelect), with no RefreshTask assertion or px != 0 on this commit.

@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate

Test-only change on 04273_concurrent_create_or_replace_refresh_mv.sh addressing the @clickhouse-gh[bot] review (the loop's || true paths could mask a regression that broke every MV creation).

# Question Answer
a Deterministic repro? Yes. With the assertion broken (e.g. REFRESH EVERY 1 HOUR -> an invalid REFRESH clause), the script's stdout drops from 3\nOK to just OK on every run, so the test fails deterministically against the reference. The healthy path deterministically emits 3\nOK.
b Root cause explained? The test only asserted a final echo OK; every CREATE OR REPLACE / SYSTEM WAIT VIEW in the loop is 2>/dev/null || true to tolerate the expected concurrent-DDL conflicts. A regression that made all MV creation/refresh fail would leave the server up, reach echo OK, and pass -- the test could green without ever exercising the refreshable-MV path.
c Fix matches root cause? Yes. Added one non-optional CREATE MATERIALIZED VIEW ... REFRESH + SYSTEM WAIT VIEW + SELECT count() before the replacers start (outside the tolerated-race section). A count mismatch there now fails the test, proving at least one refreshable MV was created and refreshed. The in-loop || true conflicts stay tolerated, so the concurrent-shutdown coverage is unchanged.
d Test intent preserved / new tests added? Preserved and strengthened. The concurrent CREATE-OR-REPLACE shutdown race is untouched (same NUM_REPLACERS=6 x REPLACES_PER_THREAD=10, same || true); the new lines run before it and also give the first round of replacers a live, refreshing MV to contend with.
e Both directions demonstrated? Yes. Healthy build: clickhouse-test PASS ([ OK ], output 3\nOK), plus 6/6 raw-script runs producing exactly 3\nOK and the server staying up across all runs. Broken-MV simulation (invalid REFRESH clause): output becomes OK only -> reference mismatch -> test FAILS.
f Fix is general, not a narrow patch? N/A -- test robustness change, no source code modified. The underlying RefreshTask::doScheduling null-deref fix (this PR's other commit) is unchanged.

Validated locally on a debug build (26.6.1.1, head b7d7581) with a memory-capped server.

Session id: cron:clickhouse-worker-slot-0:20260610-220700

@groeneai

groeneai commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

CI finished on b7d7581f1ab4. No PR-caused failures: the regression test runs OK as a regular Stateless test (amd_debug, parallel and arm_binary, parallel both pass). The return code: 126 / Permission denied failures seen earlier were on f830d43 (the single-commit rebase dropped the file's executable bit) and were fixed in 5d0e14f by restoring mode 100755. CIDB confirms 04273 is mode 100755 on HEAD and the parallel runs pass.

Remaining red checks are pre-existing trunk/infra, unrelated to this change:

  • Bugfix validation (functional tests) / 04273: framework limitation. The shutdown race is non-deterministic and cannot be made to fail on a release master HEAD in a single-test run, so validation reports "Failed to reproduce". Same as on every prior commit of this PR.
  • Stateless tests (arm_asan_ubsan, azure, sequential): Server died + Segmentation fault (STID 0883-5001) + 02724_show_indexes. The crash stack is the ASan internal allocator (LargeMmapAllocator::Deallocate -> Quarantine::DoRecycle -> ___interceptor_free, no ClickHouse frames), i.e. the 2026-06-10/11 arm_asan_ubsan azure ASan-allocator-OOM outage (CIDB: hundreds of hits across 100+ PRs and master in the same window). Not refresh-related.
  • Stress test (amd_asan_ubsan) / Stress test (arm_tsan): Hung check / possible deadlock on shutdown, chronic stress families, separately tracked.
  • AST fuzzer (amd_msan) MSan + Unit tests (asan_ubsan, function_prop_fuzzer): chronic trunk families, separately tracked.

No failure on b7d7581 touches RefreshTask / doScheduling / refreshable MVs other than the bugfix-validation framework gap above. Ready for review.

@groeneai

Copy link
Copy Markdown
Contributor Author

Removed the 04273 stateless regression test (commit fca30e2) and added the original CI crash report link to the description instead.

Why: Bugfix validation (functional tests) was the only real blocker, and a race reproducer cannot pass it. That check downloads the unfixed master HEAD release binary and runs the changed test once, with no thread fuzzer and no-random-settings, expecting a crash. The RefreshTask::doScheduling null deref is a probabilistic shutdown race (one CIDB hit in 30 days, only under stress + thread fuzzer), and px != 0 is a debug-only assertion compiled out of the release binary. I verified against the same master binary (26.6.1.626) that the test runs clean every time, so the check inverts the pass to "Failed to reproduce the bug" and blocks the PR. There is no failpoint on master to force the interleaving.

The C++ fix is unchanged. Per the bugfix-PR convention, crash fixes that can't be reproduced on the release binary are backed by the CI report link in the description (now added), which also satisfies the new-tests check.

Comment thread src/Storages/MaterializedView/RefreshTask.cpp
@groeneai groeneai force-pushed the groeneai/fix-refresh-task-shutdown-view-race-105548 branch from fca30e2 to 6be1d87 Compare June 11, 2026 06:34
@groeneai

Copy link
Copy Markdown
Contributor Author

CI on 6be1d876e226 is finished. No PR-caused failures.

The single red test job is Stateless tests (amd_llvm_coverage, old analyzer, s3 storage, DatabaseReplicated, WasmEdge, parallel) (the Mergeable Check / PR reds are just its aggregator cascade). CIDB shows zero test-level failures for that job on this commit, which is the job-level infrastructure failure pattern for the heavy coverage variant (LLVM profile / timeout), not a test regression. This commit is message-only on top of the previous green content and only touches RefreshTask.cpp null guards, so it cannot cause a real coverage-test failure.

Ready for review.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on HEAD 6be1d87 (the rebased+squashed single commit = the RefreshTask.cpp fix only). Summary of the run and disposition of every non-green signal:

  • Functional test results on this HEAD: 578 passed, 0 failed, 195 skipped. The C++ fix is holding (no px != 0 assertion, no "Unexpected exception in refresh scheduling" LOGICAL_ERROR on any recent HEAD).
  • Bugfix validation (functional tests) / 04273 — RESOLVED. The regression test was removed from the diff (it cannot pass bugfix validation: that job runs the changed test once against the unfixed master release binary, where the debug-only px != 0 assertion is compiled out and the probabilistic shutdown race effectively never fires, so a PASS is always inverted to FAIL). Net branch diff is now the RefreshTask.cpp fix only; bugfix validation is correctly skipped (no functional test file changed) and the new-tests check is satisfied via the report link in the PR body.
  • Stateless tests (amd_llvm_coverage, old analyzer, ...) job marked failed — NOT a test failure. 578 tests passed; the only red is a job-level post-step glitch in "Collect logs": jeprof ... flamegraph.pl reported "No stack counts found" (empty jemalloc heap profile) and the container connection was canceled (grpc: the client connection is closing). This is CI infrastructure noise in log collection, unrelated to the change.

No PR-caused failures. Moving to review.

@tuanpach

Copy link
Copy Markdown
Member

@groeneai Check the failure in CI. Any PR fixing it?

shutdown() runs doScheduling(is_shutdown=true) synchronously without holding
the task mutex (it deactivates the background scheduling_task first, then calls
doScheduling directly). Two concurrent shutdowns of the same refreshable MV can
therefore both reach doScheduling, and one can null `view` (shutdown() sets
view = nullptr after its final notifyDependents) while the other is still
running. The running doScheduling then dereferences a null `view` via
view->getContext()/view->getStorageID(), crashing with SIGSEGV (the weak_ptr
inside StorageMaterializedView at offset 0x140).

After the "Refreshable MV dependencies overhaul" refactor on master, the racy
post-unlock view dereference is centralized in notifyDependentsIfNeeded(), which
is reached from readZnodesIfNeeded() and updateCoordinationState() after they
release the mutex internally. Two guards close the window:

  * doScheduling(): bail at entry if `view` is already null (a parallel
    shutdown may have nulled it before this call started).
  * notifyDependentsIfNeeded(): bail if `view` is null, and snapshot
    getContext()/getStorageID() before unlocking so the unlocked
    notifyDependents() call can't dereference a view that a parallel shutdown
    nulls in the meantime. shutdown() does its own final notifyDependents(),
    so skipping it here loses nothing.

This is the null-view SIGSEGV seen under stress with the thread fuzzer
(STID 1805-36fc). It does not address the separate
"Unexpected exception in refresh scheduling" LOGICAL_ERROR (ClickHouse#106737,
STID 2508-3e7b), which is the catch-all re-raising a thrown inner exception
from the dependency path and is tracked separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the groeneai/fix-refresh-task-shutdown-view-race-105548 branch from 6be1d87 to 1149ae2 Compare June 17, 2026 09:48
@groeneai

Copy link
Copy Markdown
Contributor Author

The only red on the previous head (6be1d876) was Stateless tests (amd_llvm_coverage, old analyzer, s3 storage, DatabaseReplicated, WasmEdge, parallel). It was not a test failure and not caused by this PR, so there is no fixing PR to point at:

  • CIDB records 0 test-level FAIL/ERROR on that commit (578 stateless tests passed, 0 failed). The structured report has 0 results recorded.
  • job.log shows the job crashed at infra level after "All tests have finished", during the post-test jemalloc heap-profiling step: error during connect: ... write: broken pipe -> Job killed, exit code [125]. The coverage runner's Docker daemon died.
  • The 3 transient in-run errors (01187_set_profile_as_setting, 03579_create_table_populate_from_s3, 04247_attach_table_storage_clauses_without_engine_error) each passed 0/3 on the reproducibility reruns and are unrelated to refreshable MVs.

That run was 6 days old. I rebased the fix onto current master (no drift in RefreshTask.cpp, diff unchanged at +14/-1) and force-pushed 1149ae22, so CI is running fresh now and the stale coverage-runner red should clear. PR is MERGEABLE.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on 1149ae22fb404ab9efa24c28e9648724d8fba430 (the fresh-master rebase). No PR-caused failures. The two red stress jobs are pre-existing trunk/infra failures unrelated to this null-deref fix:

  • Stress test (amd_tsan) (Logical error: 'Unexpected exception in refresh scheduling' STID 2508-3e7b, plus the Cannot start clickhouse-server / Check failed cascade from the same server abort): this is the chronic RefreshTask create()-initial-scheduling re-raise family, hitting 13+ unrelated PRs plus master in the last 30 days. It is a thrown-exception path (UNKNOWN_DATABASE from unqualified REFRESH DEPENDS ON), distinct from this PR's null-view SIGSEGV shutdown race. Tracked separately and fixed by Fix crash on REFRESH ... DEPENDS ON shadowed by a temporary table #107156 (Resolve REFRESH ... DEPENDS ON dependencies in the view's database).
  • Stress test (arm_msan) (Hung check failed, possible deadlock found): chronic msan stress-harness hang, not in the refresh path this PR touches.

This PR's C++ diff is confined to the doScheduling/notifyDependentsIfNeeded null guards in RefreshTask.cpp; neither stress failure originates there. Ready for review.

@tuanpach

Copy link
Copy Markdown
Member

@groeneai

Merge master to this branch

@nikitamikhaylov

Copy link
Copy Markdown
Member

@groeneai Hello?

@groeneai

Copy link
Copy Markdown
Contributor Author

@nikitamikhaylov Sorry for the delay. @tuanpach approved on 06-22 and in the same minute asked me to merge master into the branch; I just did that.

  • Merged origin/master into the branch (no conflicts). New head 6a3d451, MERGEABLE, fresh CI running. The approved commit 1149ae22 stays in history so the approval is preserved.
  • The C++ change is unchanged and confined to RefreshTask.cpp: a if (!view) return; guard at doScheduling entry plus a guard + snapshot-before-unlock in notifyDependentsIfNeeded. It fixes the null-view SIGSEGV during parallel shutdown() (STID 1805-36fc).
  • The only reds on the previous head were the Stress test (*) "Unexpected exception in refresh scheduling" family. That is a separate trunk bug: a thrown exception re-raised by the catch-all in the create() initial-scheduling lambda (RefreshTask.cpp:223 -> :936), not the shutdown null-deref this PR fixes (a SIGSEGV never reaches that catch). The UNKNOWN_DATABASE variant of it (STID 2508-3e7b) was already fixed by the merged Fix crash on REFRESH ... DEPENDS ON shadowed by a temporary table #107156; a residual variant is still on master and tracked separately, independent of this PR.

Should be ready to merge once the fresh CI on 6a3d451 settles.

@clickhouse-gh

clickhouse-gh Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.50% 92.50% +0.00%
Branches 77.60% 77.50% -0.10%

Changed lines: Changed C/C++ lines covered by tests: 12/14 (85.71%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants