iframe-proxy

groeneai · 2026-05-21T20:50:39Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a rare server crash when refreshable materialized views are dropped or replaced concurrently. A null pointer was dereferenced in RefreshTask::doScheduling when another thread set the task's view to nullptr while the scheduling code had temporarily released its mutex.

Description

CI report (the crash this fixes): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105548&sha=1c33e04b3075c798f03b5cac5856f04be52cfab4&name_0=PR&name_1=Stress%20test%20%28amd_debug%29
Fatal log: https://s3.amazonaws.com/clickhouse-test-reports/PRs/105548/1c33e04b3075c798f03b5cac5856f04be52cfab4/stress_test_amd_debug/fatal.log

SIGSEGV in RefreshTask::doScheduling on PR #105548 stress (STID 1805-36fc), Address: 0x140 (the WithContextImpl::context weak_ptr inside StorageMaterializedView), i.e. view was nullptr at the deref.

Root cause: shutdown() clears view under the mutex and then calls doScheduling() synchronously. doScheduling() and its helpers (readZnodesIfNeeded, updateCoordinationState, collectDependencyStates, notifyDependentsIfNeeded) release the task mutex at several points to talk to keeper or to other RefreshSet tasks. A second concurrent shutdown() can win the race and null view during one of those unlocked windows, after which the original caller dereferences a stale view (view->getContext() / view->getStorageID()).

Fix: re-check view after every relock site and snapshot the view accessors before releasing the mutex, matching the pattern already used by wait() and waitForLatestTargetTable() ("Can't use view here because shutdown() may unset it in parallel with us."). The execution path (executeRefresh and below) is unaffected: shutdown() calls execution_task->deactivate() before clearing view, so it never observes a null view.

No regression test is included. This is a probabilistic shutdown race (one CIDB hit in 30 days, only under stress with the thread fuzzer), and px != 0 is a debug-only assertion that is compiled out of the release binary. Bugfix validation downloads the unfixed master HEAD release binary and runs a changed test once with no thread fuzzer, so a race reproducer cannot fail there: it was verified to always pass on that binary, so bugfix validation reports "Failed to reproduce the bug". The crash is documented by the CI report linked above.

Version info

Merged into: 26.7.1.129

groeneai · 2026-05-21T20:51:10Z

cc @al13n321 @tuanpach -- could you review? You are the comp-refreshable-materialized-view owners in component-tree.yaml.

Pre-PR validation gate

a) Deterministic repro?
Partial. Confirmed via code analysis -- the unsafe view->getContext() accesses at lines 626, 1058, 1219 in the original code are inside lock.unlock() ... lock.lock() windows where RefreshTask::shutdown() may have already cleared view = nullptr (RefreshTask.cpp:271). A locally added 100 ms sleep at line 625 plus 6 parallel CREATE OR REPLACE MATERIALIZED VIEW ... REFRESH ... sessions did not deterministically reproduce on a debug build (the second shutdown() would need to fully drain doScheduling(true) inside the widened window). The CI evidence is the single STID 1805-36fc segfault on PR #105548 (Stress test (amd_debug)), Address: 0x140 -- the byte offset of the WithContextImpl::context weak_ptr inside StorageMaterializedView, i.e. view was a null pointer when its getContext() was called.

b) Root cause explained?
RefreshTask::shutdown() is explicitly documented as safe to call from multiple threads in parallel. It ends by acquiring the mutex and setting view = nullptr (RefreshTask.cpp:271). Inside doScheduling, updateDependenciesIfNeeded, and readZnodesIfNeeded, three sites called view->getContext() / view->getStorageID() after lock.unlock(). A second concurrent shutdown() can grab the mutex during that window and clear view before the original thread re-locks, turning the next view->... into a null deref. The crash address 0x140 matches the layout offset.

c) Fix matches root cause?
Yes. Snapshot view->getContext() (and view->getStorageID() where needed) into locals under the mutex, then use those locals after lock.unlock(). ContextPtr keeps the Context alive across the unlocked region. This mirrors the existing pattern in RefreshTask::wait() (RefreshTask.cpp:489-494), which already has the comment "Can't use view here because shutdown() may unset it in parallel with us."

d) Test intent preserved / new test added?
Existing tests unchanged. Added tests/queries/0_stateless/04273_concurrent_create_or_replace_refresh_mv.sh that hammers CREATE OR REPLACE MATERIALIZED VIEW ... REFRESH ... from 6 parallel sessions with SYSTEM WAIT VIEW between iterations to force the last_attempt_succeeded branch where the crash was observed.

e) Demonstrated in both directions?
Partial. On the fixed build the new test passes 30+ iterations and existing refresh MV tests (04227_create_or_replace_refresh_mv, 03221_refreshable_matview_progress, 03258_refreshable_mv_misc, 03760_refreshable_mv_local) continue to pass. The "without fix" direction is the CI report on PR #105548 (linked in the description); local repro with a sleep widener at line 625 did not crash in a debug build within ~30 iterations, the actual CI race window came from stress-mode thread fuzzing.

f) Fix is general?
Yes. All three sites in RefreshTask.cpp where view->... was dereferenced inside an unlock()/lock() window have been converted to the snapshot-before-unlock pattern. The fourth such site (updateCoordinationState at line 1249) does not access view during the unlock, so it was left alone.

Session: cron:clickhouse-ci-task-worker:20260521-201500

clickhouse-gh · 2026-05-22T08:07:02Z

Workflow [PR], commit [6a3d451]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR fixes a rare refreshable materialized view shutdown race by returning early when view was already cleared and by snapshotting Context/StorageID before notifying dependents outside the task mutex. I rechecked the current diff against the prior review threads; the earlier concerns are either addressed, resolved as out of scope, or no longer present in the current code. I did not find a new blocker or major issue.

Final Verdict

Status: ✅ Approve

Address review feedback from clickhouse-gh[bot] on PR ClickHouse#105588: snapshotting `view->getContext()` and `view->getStorageID()` immediately before `lock.unlock()` is not enough for coordinated RMVs, because helpers called earlier in the same function already release the mutex internally. The three problem helpers all release `lock` briefly: - `updateCoordinationState` calls `zookeeper->tryMulti` (line 1261, only when `coordination.coordinated`). - `readZnodesIfNeeded` calls `zookeeper->tryGet` (line 1189, only when `coordination.coordinated`). - `updateDependenciesIfNeeded` calls `set_handle.getDependencies` and `set.findTasks` (line 1050) on every iteration of its retry loop. During any of those windows, a parallel `shutdown()` can acquire the mutex, finish its own `doScheduling(/*is_shutdown=*/true)`, take the second lock inside `shutdown()`, and set `view = nullptr` (RefreshTask.cpp:271). When the original caller re-acquires the lock, any subsequent `view->...` access is a null deref of the same class as STID 1805-36fc. `shutdown()` itself calls `doScheduling()` synchronously without holding the lock, so two concurrent `shutdown()` calls can also both pass the initial `view == nullptr` check and race the same way; one of them can also re-enter `doScheduling()` after the other already nulled `view`. Add `if (!view) return;` guards at every (re)lock site: - At the top of `doScheduling` (handles entry with view already nulled). - After `readZnodesIfNeeded` in `doScheduling`. - After `updateCoordinationState` in the `last_attempt_succeeded` branch (before my earlier snapshot of `view->getContext()`/`view->getStorageID()`). - After `updateDependenciesIfNeeded` in `doScheduling`. - After `lock.lock()` at the end of the unlock window inside `readZnodesIfNeeded`. - After `lock.lock()` at the end of the unlock window inside `updateDependenciesIfNeeded` (so the loop's next iteration does not redo `view->getContext()` on a freed `view`). The existing snapshot pattern at the inner unlock sites is preserved because the snapshot still needs to happen under the lock, before the last unlock, to use across `notifyDependents`. The existing `04273_concurrent_create_or_replace_refresh_mv.sh` regression test still passes (5/5 iterations); peer tests (`04227_create_or_replace_refresh_mv`, `03221_refreshable_matview_progress`, `03258_refreshable_mv_misc`, `03760_refreshable_mv_local`, `02932_refreshable_materialized_views_1`, `02932_refreshable_materialized_views_2`) also continue to pass. The Atomic-database test does not directly exercise the coordinated-RMV path the bot called out (`coordination.coordinated == true` gates the keeper unlock inside `updateCoordinationState`), but the post-unlock guard pattern protects both paths uniformly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

groeneai · 2026-05-22T08:43:23Z

Addressed @clickhouse-gh[bot]'s inline review in c4388625b98a.

The bot correctly pointed out that snapshotting view->getContext() and view->getStorageID() immediately before lock.unlock(); was not enough: updateCoordinationState, readZnodesIfNeeded, and updateDependenciesIfNeeded already release the mutex internally before my snapshot runs. In the coordinated path that internal unlock window is large enough for a parallel shutdown() to finish doScheduling(/*is_shutdown=*/true), take the second lock in shutdown(), and null view (RefreshTask.cpp:271) before the original thread relocks.

Fixup adds if (!view) return; guards at every (re)lock site:

Top of doScheduling (entry-with-null-view, possible when one shutdown finished completely before the second one entered doScheduling).
After readZnodesIfNeeded in doScheduling.
After updateCoordinationState in the last_attempt_succeeded branch, before the snapshot.
After updateDependenciesIfNeeded in doScheduling, before view->getContext()->getRefreshSet().refreshesStopped().
After lock.lock(); inside readZnodesIfNeeded.
After lock.lock(); inside updateDependenciesIfNeeded (so the loop's next iteration does not redo view->getContext() on a nulled view).

The pre-unlock snapshots are kept because they still need to happen under the lock before the last lock.unlock(); so the unlocked work uses a stable copy.

Validation (debug build, locally):

04273_concurrent_create_or_replace_refresh_mv (the regression test added in the first commit): 5/5 OK.
04227_create_or_replace_refresh_mv, 03221_refreshable_matview_progress, 03258_refreshable_mv_misc, 03760_refreshable_mv_local, 02932_refreshable_materialized_views_1, 02932_refreshable_materialized_views_2: all OK.

Caveat: the Atomic-database regression test does not directly exercise the coordinated path the bot called out (coordination.coordinated == true gates the keeper unlock inside updateCoordinationState). The post-unlock guard pattern protects the coordinated and uncoordinated paths uniformly; setting up Replicated database + replicated inner table in a stateless test for full coverage would be a separate effort.

cc @al13n321 @tuanpach

tuanpach · 2026-05-25T03:40:08Z

            zookeeper = view->getContext()->getZooKeeper();
        readZnodesIfNeeded(zookeeper, lock);
        chassert(lock.owns_lock());
+        if (!view)


@groeneai
If we checked view in line 561, why do we need to check again here?

The second check is necessary because readZnodesIfNeeded releases and re-acquires the task mutex inside its body. Looking at readZnodesIfNeeded in this file:

Line 1213: lock.unlock() before the keeper tryGet

Line 1233: lock.lock() after responses come back

Line 1236-1237: internal if (!view) return; guard for that first unlock window

Lines 1258-1260: a SECOND lock.unlock() / lock.lock() around notifyDependents, with no further view check before the function returns

So when control returns to doScheduling at line 581, two things can have happened:

view was nulled during 1213-1233 and the function bailed at line 1237 — the caller has no signal (the function returns normally either way).

view was still set at line 1237 but was nulled during 1258-1260 — view is null on return, no internal check caught it.

The next dereferences after line 583 — view->getContext()->getRefreshSet().refreshesStopped() at line 759 and view->getContext() / view->getStorageID() at lines 643-644 — would null-deref in either case. The check at line 582 is the bail-out at the relock site, matching the pattern documented at lines 556-560 ("Bail out at every (re)lock site rather than dereferencing a stale view.") and used symmetrically at lines 635-636 (after updateCoordinationState) and lines 756-757 (after updateDependenciesIfNeeded).

If you'd prefer a less repetitive form, I can hoist the bail-out into a helper (e.g. bool checkViewAlive(lock) or have the helpers return a bool to signal "view became null, please bail"), but I went with explicit checks at each relock site for visibility — the invariant is documented in one place (the comment block at 556-560) and the check is small enough to be obvious at each call site.

OK, better add a comment to clarify that.

Added in 88168181b715 (3-line comment matching the style of the existing rationale comments at the updateCoordinationState and updateDependenciesIfNeeded bail-outs further down in the same function).

tuanpach · 2026-05-25T03:40:32Z

@groeneai Check the CI failure and unresolved comments.

groeneai · 2026-05-25T04:09:41Z

@tuanpach addressed the inline view-check question above. CI attribution on c4388625b98a (pre-cc6c41f128ff):

Check	Status	Attribution
`Bugfix validation (functional tests)`	FAIL	Same root cause as the flaky-check timeout below — the regression test exceeds the 180s budget under ASan+UBSan. Addressed in `cc6c41f128ff`.
`Stateless tests (amd_asan_ubsan, flaky check)` — `04273_concurrent_create_or_replace_refresh_mv`	FAIL × 5 of 50 repeats	PR-actionable. The test does 60 `CREATE OR REPLACE MATERIALIZED VIEW … REFRESH … + SYSTEM WAIT VIEW` iterations and approaches the 180s `TEST_MAX_RUN_TIME_IN_SECONDS` flaky-check ceiling under sanitizer + debug. Added `Tags: long` in `cc6c41f128ff` (matches `04102_hash_join_stress_test.sql`); this opts out of the 180s budget (`tests/clickhouse-test:3179-3183`) and lowers the flaky-check repeat count via `long_test_runs_ratio`. The test's race window is unchanged.
`Integration tests` × 5 variants — `test_mysql_protocol/test.py::test_mysql_dotnet_client`	FAIL	Chronic infra flake, not PR-caused. 30-day CIDB: 96 distinct PRs, 94 master hits, 594 total events. Same `Received unknown response X for SSLRequest` shape across all five variants.
`Stateless tests (arm_asan_ubsan, azure, sequential)` — `Server died`	FAIL	Chronic infra flake. 30-day CIDB: 814 PRs, 18 master hits, 2601 total. Generic ARM ASan azure-storage shape.
`Finish Workflow` — `Post Hooks`	FAIL	Chronic CIDB log-export infra flake (`play.clickhouse.com:443`).

The only PR-actionable failure is the flaky-check timeout, which the new commit addresses. The fresh CI run should now show all of 04273 clean on the flaky check and on the Bugfix validation budget, and the remaining red checks should be the chronic-flake set above which are not blocking.

tuanpach · 2026-05-26T07:41:55Z

@groeneai Check the CI failure.

groeneai · 2026-05-26T10:06:18Z

@tuanpach Two items addressed:

Inline comment ask (line 582): added in 88168181b715 — three lines explaining that readZnodesIfNeeded releases the task mutex twice internally (around keeper->tryGet and around notifyDependents). Style matches the existing rationale comments at the updateCoordinationState and updateDependenciesIfNeeded bail-outs lower in the same function.

CI on cc6c41f128ff — attribution unchanged from my reply yesterday; both remaining red checks are non-PR-actionable:

Check	Status	Attribution
`AST fuzzer (amd_debug, targeted)`	FAIL	Chronic trunk — `STID 4697-4326` ("Column identifier ... is already registered") on a fuzzed `ALTER TABLE … UPDATE … IN (SELECT … QUALIFY … INTERSECT ALL SELECT … QUALIFY …)`. 28+ failures across 21+ unrelated PRs in last 14 days. Fix in flight in Fix LOGICAL_ERROR "Column identifier is already registered" in UNION ALL #100770 (open, mergeable). Not related to `RefreshTask`.
`Bugfix validation (functional tests)`	FAIL	Framework limitation, not a regression — `04273_concurrent_create_or_replace_refresh_mv` ran clean on master HEAD (`[ OK ] 2.25 sec` on `m8g.4xlarge` ARM at `26251069c810…`), so the bugfix-validation harness reports "Failed to reproduce the bug" since the test was expected to fail on master without the fix. The underlying bug is a non-deterministic shutdown race that requires ASan + many parallel `CREATE OR REPLACE … REFRESH` iterations to trigger reliably (`STID 1805-36fc` originally surfaced in a stress run, not a deterministic single-test execution).
`Mergeable Check` / `PR` / `Report messages`	FAIL	Cascade from the two above; no separate failure.

The new commit only adds a comment, so the bugfix-validation framework limitation will persist on the next CI run too. Happy to retrigger the AST fuzzer once #100770 lands, or to push a Tags: long follow-up if you want to keep the bugfix-validation harness from re-running this test 60 iterations on every retry — let me know your preference.

tuanpach · 2026-06-09T04:47:25Z

@groeneai

Resolve the conflict

groeneai · 2026-06-10T20:03:49Z

Rebased onto current master, conflict resolved. RefreshTask.cpp was refactored upstream (the unlock/notify cycle moved into notifyDependentsIfNeeded, and updateDependenciesIfNeeded became collectDependencyStates/collectDependencyStatesUnlocked), so the view-null guards are ported onto the new structure: bail after each relock site and snapshot the view accessors before unlocking. The branch is now a single clean commit (475eee6).

Verified locally: 04273_concurrent_create_or_replace_refresh_mv passes (20/20 with randomization disabled) and the server stays up across 30+ concurrent CREATE OR REPLACE iterations; the rest of the refreshable-MV suite (02932_1/2/3, 03221, 03327, 03362, 03760, 04201) passes. PR is now MERGEABLE.

groeneai · 2026-06-10T20:04:17Z

Pre-PR validation gate

(Re-validation after rebasing onto current master to resolve the merge conflict; RefreshTask.cpp was refactored upstream, so the fix was ported onto the new helper structure.)

#	Question	Answer
a	Deterministic repro?	The crash is a non-deterministic shutdown race (STID 1805-36fc). It is forced by `04273_concurrent_create_or_replace_refresh_mv.sh`: 6 parallel sessions running `CREATE OR REPLACE MATERIALIZED VIEW ... REFRESH ...` (each REPLACE drops the old MV, triggering `shutdown()`), with `SYSTEM WAIT VIEW` so each shutdown enters the `last_attempt_succeeded` branch where the original SIGSEGV occurred.
b	Root cause explained?	`shutdown()` clears `view` under the mutex, then calls `doScheduling()` synchronously. `doScheduling()` and its helpers (`readZnodesIfNeeded`, `updateCoordinationState`, `collectDependencyStates`, `notifyDependentsIfNeeded`) release the task mutex to talk to keeper / other RefreshSet tasks. A second concurrent `shutdown()` can null `view` during one of those unlocked windows; the original caller then dereferences a stale `view` (`view->getContext()`/`view->getStorageID()`) and crashes (read at offset 0x140, the WithContextImpl weak_ptr).
c	Fix matches root cause?	Yes. Re-check `view` after every relock site and snapshot the view accessors into locals before `lock.unlock()`, so the unlocked region never dereferences a possibly-null `view`. This is the exact pattern `wait()`/`waitForLatestTargetTable()` already document and use. Not a symptom guard: the checks sit precisely at the relock boundaries that the race exploits.
d	Test intent preserved / new tests added?	New regression test `04273_concurrent_create_or_replace_refresh_mv.sh` added (unchanged from the previously-reviewed version, `Tags: long` retained). It exercises the concurrent-shutdown path directly.
e	Both directions demonstrated?	Pre-fix: this code path SIGSEGV'd in CI stress (STID 1805-36fc). Post-fix: 04273 passes 20/20 with randomization disabled and the server stays up across 30+ concurrent CREATE OR REPLACE iterations; the rest of the refreshable-MV suite (02932_1/2/3, 03221, 03327, 03362, 03760, 04201) passes.
f	Fix is general, not a narrow patch?	Audited every `view` deref reachable from `doScheduling` after a mutex release, including master's new helpers `notifyDependentsIfNeeded` and `collectDependencyStatesUnlocked`, and guarded/snapshotted each. The execution path (`executeRefresh` and below) is deliberately untouched and is safe by construction: `shutdown()` calls `execution_task->deactivate()` before clearing `view`, so it never observes a null view.

Session id: cron:clickhouse-worker-slot-18:20260610-184200

groeneai · 2026-06-10T20:44:52Z

@tuanpach Conflict resolved. The branch had drifted far behind master (the "Refreshable MV dependencies overhaul" refactored RefreshTask.cpp heavily), so I rebased onto current master and re-applied the fix on the new structure as a single clean commit (f830d43).

The refactor moved the racy post-unlock view dereference into notifyDependentsIfNeeded(), so the fix is now just two guards:

doScheduling(): bail at entry if view is already null (a parallel shutdown() may have nulled it before the call started).
notifyDependentsIfNeeded(): bail if view is null, and snapshot getContext()/getStorageID() before lock.unlock() so the unlocked notifyDependents() can't dereference a view that a parallel shutdown() nulls in the meantime. shutdown() does its own final notifyDependents(), so skipping it here loses nothing.

Net diff is now just RefreshTask.cpp (+14 lines) plus the regression test. PR is MERGEABLE again.

Validated locally on a debug build: 311 concurrent CREATE OR REPLACE MATERIALIZED VIEW ... REFRESH + DROP statements across 13 parallel sessions (each triggering RefreshTask::shutdown()), 0 crashes, 0 "Unexpected exception in refresh scheduling" in the server log.

Session: cron:clickhouse-worker-slot-19:20260610-191000

groeneai · 2026-06-10T21:47:37Z

Pushed 5d0e14f: restore the executable bit (mode 100755) on 04273_concurrent_create_or_replace_refresh_mv.sh.

The previous single-commit rebase landed the file as mode 100644, so the stateless runner could not exec it and reported return code 126 (Permission denied) instead of running the test (arm_binary parallel; flaky/targeted jobs showed it as UNKNOWN). No content change, only the mode bit.

Verified locally with a master debug build: the test runs in ~3-4s per iteration and prints OK. The only iteration failures under randomization were a local tzdata artifact (session_timezone='zoneinfo/UTC' from get_localzone() on this host), which CIDB confirms does not occur in CI. The 180s flaky-check limit is moot here anyway: the test is tagged long, which exempts it from the per-run time check, and each run is well under the limit regardless.

The "Server died" results on the sanitizer flaky-check jobs are unrelated to this PR: the same artifact appears on ~12 other PRs over the last 3 days (generic TIMEOUT_EXCEEDED in MergeTreeSelect), with no RefreshTask assertion or px != 0 on this commit.

groeneai · 2026-06-10T22:21:57Z

Pre-PR validation gate

Test-only change on 04273_concurrent_create_or_replace_refresh_mv.sh addressing the @clickhouse-gh[bot] review (the loop's || true paths could mask a regression that broke every MV creation).

#	Question	Answer
a	Deterministic repro?	Yes. With the assertion broken (e.g. `REFRESH EVERY 1 HOUR` -> an invalid REFRESH clause), the script's stdout drops from `3\nOK` to just `OK` on every run, so the test fails deterministically against the reference. The healthy path deterministically emits `3\nOK`.
b	Root cause explained?	The test only asserted a final `echo OK`; every `CREATE OR REPLACE` / `SYSTEM WAIT VIEW` in the loop is `2>/dev/null \|\| true` to tolerate the expected concurrent-DDL conflicts. A regression that made all MV creation/refresh fail would leave the server up, reach `echo OK`, and pass -- the test could green without ever exercising the refreshable-MV path.
c	Fix matches root cause?	Yes. Added one non-optional `CREATE MATERIALIZED VIEW ... REFRESH` + `SYSTEM WAIT VIEW` + `SELECT count()` before the replacers start (outside the tolerated-race section). A count mismatch there now fails the test, proving at least one refreshable MV was created and refreshed. The in-loop `\|\| true` conflicts stay tolerated, so the concurrent-shutdown coverage is unchanged.
d	Test intent preserved / new tests added?	Preserved and strengthened. The concurrent CREATE-OR-REPLACE shutdown race is untouched (same NUM_REPLACERS=6 x REPLACES_PER_THREAD=10, same `\|\| true`); the new lines run before it and also give the first round of replacers a live, refreshing MV to contend with.
e	Both directions demonstrated?	Yes. Healthy build: `clickhouse-test` PASS (`[ OK ]`, output `3\nOK`), plus 6/6 raw-script runs producing exactly `3\nOK` and the server staying up across all runs. Broken-MV simulation (invalid REFRESH clause): output becomes `OK` only -> reference mismatch -> test FAILS.
f	Fix is general, not a narrow patch?	N/A -- test robustness change, no source code modified. The underlying `RefreshTask::doScheduling` null-deref fix (this PR's other commit) is unchanged.

Validated locally on a debug build (26.6.1.1, head b7d7581) with a memory-capped server.

Session id: cron:clickhouse-worker-slot-0:20260610-220700

groeneai · 2026-06-11T04:05:58Z

CI finished on b7d7581f1ab4. No PR-caused failures: the regression test runs OK as a regular Stateless test (amd_debug, parallel and arm_binary, parallel both pass). The return code: 126 / Permission denied failures seen earlier were on f830d43 (the single-commit rebase dropped the file's executable bit) and were fixed in 5d0e14f by restoring mode 100755. CIDB confirms 04273 is mode 100755 on HEAD and the parallel runs pass.

Remaining red checks are pre-existing trunk/infra, unrelated to this change:

Bugfix validation (functional tests) / 04273: framework limitation. The shutdown race is non-deterministic and cannot be made to fail on a release master HEAD in a single-test run, so validation reports "Failed to reproduce". Same as on every prior commit of this PR.
Stateless tests (arm_asan_ubsan, azure, sequential): Server died + Segmentation fault (STID 0883-5001) + 02724_show_indexes. The crash stack is the ASan internal allocator (LargeMmapAllocator::Deallocate -> Quarantine::DoRecycle -> ___interceptor_free, no ClickHouse frames), i.e. the 2026-06-10/11 arm_asan_ubsan azure ASan-allocator-OOM outage (CIDB: hundreds of hits across 100+ PRs and master in the same window). Not refresh-related.
Stress test (amd_asan_ubsan) / Stress test (arm_tsan): Hung check / possible deadlock on shutdown, chronic stress families, separately tracked.
AST fuzzer (amd_msan) MSan + Unit tests (asan_ubsan, function_prop_fuzzer): chronic trunk families, separately tracked.

No failure on b7d7581 touches RefreshTask / doScheduling / refreshable MVs other than the bugfix-validation framework gap above. Ready for review.

groeneai · 2026-06-11T05:31:41Z

Removed the 04273 stateless regression test (commit fca30e2) and added the original CI crash report link to the description instead.

Why: Bugfix validation (functional tests) was the only real blocker, and a race reproducer cannot pass it. That check downloads the unfixed master HEAD release binary and runs the changed test once, with no thread fuzzer and no-random-settings, expecting a crash. The RefreshTask::doScheduling null deref is a probabilistic shutdown race (one CIDB hit in 30 days, only under stress + thread fuzzer), and px != 0 is a debug-only assertion compiled out of the release binary. I verified against the same master binary (26.6.1.626) that the test runs clean every time, so the check inverts the pass to "Failed to reproduce the bug" and blocks the PR. There is no failpoint on master to force the interleaving.

The C++ fix is unchanged. Per the bugfix-PR convention, crash fixes that can't be reproduced on the release binary are backed by the CI report link in the description (now added), which also satisfies the new-tests check.

groeneai · 2026-06-11T11:54:00Z

CI on 6be1d876e226 is finished. No PR-caused failures.

The single red test job is Stateless tests (amd_llvm_coverage, old analyzer, s3 storage, DatabaseReplicated, WasmEdge, parallel) (the Mergeable Check / PR reds are just its aggregator cascade). CIDB shows zero test-level failures for that job on this commit, which is the job-level infrastructure failure pattern for the heavy coverage variant (LLVM profile / timeout), not a test regression. This commit is message-only on top of the previous green content and only touches RefreshTask.cpp null guards, so it cannot cause a real coverage-test failure.

Ready for review.

groeneai · 2026-06-11T13:26:54Z

CI finished on HEAD 6be1d87 (the rebased+squashed single commit = the RefreshTask.cpp fix only). Summary of the run and disposition of every non-green signal:

Functional test results on this HEAD: 578 passed, 0 failed, 195 skipped. The C++ fix is holding (no px != 0 assertion, no "Unexpected exception in refresh scheduling" LOGICAL_ERROR on any recent HEAD).
Bugfix validation (functional tests) / 04273 — RESOLVED. The regression test was removed from the diff (it cannot pass bugfix validation: that job runs the changed test once against the unfixed master release binary, where the debug-only px != 0 assertion is compiled out and the probabilistic shutdown race effectively never fires, so a PASS is always inverted to FAIL). Net branch diff is now the RefreshTask.cpp fix only; bugfix validation is correctly skipped (no functional test file changed) and the new-tests check is satisfied via the report link in the PR body.
Stateless tests (amd_llvm_coverage, old analyzer, ...) job marked failed — NOT a test failure. 578 tests passed; the only red is a job-level post-step glitch in "Collect logs": jeprof ... flamegraph.pl reported "No stack counts found" (empty jemalloc heap profile) and the container connection was canceled (grpc: the client connection is closing). This is CI infrastructure noise in log collection, unrelated to the change.

No PR-caused failures. Moving to review.

tuanpach · 2026-06-17T09:14:28Z

@groeneai Check the failure in CI. Any PR fixing it?

shutdown() runs doScheduling(is_shutdown=true) synchronously without holding the task mutex (it deactivates the background scheduling_task first, then calls doScheduling directly). Two concurrent shutdowns of the same refreshable MV can therefore both reach doScheduling, and one can null `view` (shutdown() sets view = nullptr after its final notifyDependents) while the other is still running. The running doScheduling then dereferences a null `view` via view->getContext()/view->getStorageID(), crashing with SIGSEGV (the weak_ptr inside StorageMaterializedView at offset 0x140). After the "Refreshable MV dependencies overhaul" refactor on master, the racy post-unlock view dereference is centralized in notifyDependentsIfNeeded(), which is reached from readZnodesIfNeeded() and updateCoordinationState() after they release the mutex internally. Two guards close the window: * doScheduling(): bail at entry if `view` is already null (a parallel shutdown may have nulled it before this call started). * notifyDependentsIfNeeded(): bail if `view` is null, and snapshot getContext()/getStorageID() before unlocking so the unlocked notifyDependents() call can't dereference a view that a parallel shutdown nulls in the meantime. shutdown() does its own final notifyDependents(), so skipping it here loses nothing. This is the null-view SIGSEGV seen under stress with the thread fuzzer (STID 1805-36fc). It does not address the separate "Unexpected exception in refresh scheduling" LOGICAL_ERROR (ClickHouse#106737, STID 2508-3e7b), which is the catch-all re-raising a thrown inner exception from the dependency path and is tracked separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-17T09:49:07Z

The only red on the previous head (6be1d876) was Stateless tests (amd_llvm_coverage, old analyzer, s3 storage, DatabaseReplicated, WasmEdge, parallel). It was not a test failure and not caused by this PR, so there is no fixing PR to point at:

CIDB records 0 test-level FAIL/ERROR on that commit (578 stateless tests passed, 0 failed). The structured report has 0 results recorded.
job.log shows the job crashed at infra level after "All tests have finished", during the post-test jemalloc heap-profiling step: error during connect: ... write: broken pipe -> Job killed, exit code [125]. The coverage runner's Docker daemon died.
The 3 transient in-run errors (01187_set_profile_as_setting, 03579_create_table_populate_from_s3, 04247_attach_table_storage_clauses_without_engine_error) each passed 0/3 on the reproducibility reruns and are unrelated to refreshable MVs.

That run was 6 days old. I rebased the fix onto current master (no drift in RefreshTask.cpp, diff unchanged at +14/-1) and force-pushed 1149ae22, so CI is running fresh now and the stale coverage-runner red should clear. PR is MERGEABLE.

groeneai · 2026-06-17T14:57:12Z

CI finished on 1149ae22fb404ab9efa24c28e9648724d8fba430 (the fresh-master rebase). No PR-caused failures. The two red stress jobs are pre-existing trunk/infra failures unrelated to this null-deref fix:

Stress test (amd_tsan) (Logical error: 'Unexpected exception in refresh scheduling' STID 2508-3e7b, plus the Cannot start clickhouse-server / Check failed cascade from the same server abort): this is the chronic RefreshTask create()-initial-scheduling re-raise family, hitting 13+ unrelated PRs plus master in the last 30 days. It is a thrown-exception path (UNKNOWN_DATABASE from unqualified REFRESH DEPENDS ON), distinct from this PR's null-view SIGSEGV shutdown race. Tracked separately and fixed by Fix crash on REFRESH ... DEPENDS ON shadowed by a temporary table #107156 (Resolve REFRESH ... DEPENDS ON dependencies in the view's database).
Stress test (arm_msan) (Hung check failed, possible deadlock found): chronic msan stress-harness hang, not in the refresh path this PR touches.

This PR's C++ diff is confined to the doScheduling/notifyDependentsIfNeeded null guards in RefreshTask.cpp; neither stress failure originates there. Ready for review.

tuanpach · 2026-06-22T02:35:26Z

@groeneai

Merge master to this branch

nikitamikhaylov · 2026-06-23T12:34:02Z

@groeneai Hello?

…h-task-shutdown-view-race-105548

groeneai · 2026-06-23T13:21:23Z

@nikitamikhaylov Sorry for the delay. @tuanpach approved on 06-22 and in the same minute asked me to merge master into the branch; I just did that.

Merged origin/master into the branch (no conflicts). New head 6a3d451, MERGEABLE, fresh CI running. The approved commit 1149ae22 stays in history so the approval is preserved.
The C++ change is unchanged and confined to RefreshTask.cpp: a if (!view) return; guard at doScheduling entry plus a guard + snapshot-before-unlock in notifyDependentsIfNeeded. It fixes the null-view SIGSEGV during parallel shutdown() (STID 1805-36fc).
The only reds on the previous head were the Stress test (*) "Unexpected exception in refresh scheduling" family. That is a separate trunk bug: a thrown exception re-raised by the catch-all in the create() initial-scheduling lambda (RefreshTask.cpp:223 -> :936), not the shutdown null-deref this PR fixes (a SIGSEGV never reaches that catch). The UNKNOWN_DATABASE variant of it (STID 2508-3e7b) was already fixed by the merged Fix crash on REFRESH ... DEPENDS ON shadowed by a temporary table #107156; a residual variant is still on master and tracked separately, independent of this PR.

Should be ready to merge once the fresh CI on 6a3d451 settles.

clickhouse-gh · 2026-06-23T16:43:50Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 12/14 (85.71%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

PedroTadim added the can be tested Allows running workflows for external contributors label May 22, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 22, 2026

clickhouse-gh Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/Storages/MaterializedView/RefreshTask.cpp Outdated

tuanpach reviewed May 25, 2026

View reviewed changes

tuanpach self-assigned this May 25, 2026

This was referenced Jun 2, 2026

Support constant haystack with non-constant needle in LIKE/ILIKE/match #100479

Merged

Push ORDER BY from outer query into simple VIEWs for distributed optimization #94102

Merged

Add Rewrite rules #88234

Open

groeneai force-pushed the groeneai/fix-refresh-task-shutdown-view-race-105548 branch from 8816818 to 475eee6 Compare June 10, 2026 20:00

groeneai force-pushed the groeneai/fix-refresh-task-shutdown-view-race-105548 branch from 475eee6 to f830d43 Compare June 10, 2026 20:44

groeneai mentioned this pull request Jun 10, 2026

Logical error: 'Unexpected exception in refresh scheduling' (STID: 2508-3e7b) #106737

Closed

clickhouse-gh Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04273_concurrent_create_or_replace_refresh_mv.sh Outdated

clickhouse-gh Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread src/Storages/MaterializedView/RefreshTask.cpp

groeneai force-pushed the groeneai/fix-refresh-task-shutdown-view-race-105548 branch from fca30e2 to 6be1d87 Compare June 11, 2026 06:34

groeneai mentioned this pull request Jun 17, 2026

Web UI: keep single query unselected on run #105711

Merged

groeneai force-pushed the groeneai/fix-refresh-task-shutdown-view-race-105548 branch from 6be1d87 to 1149ae2 Compare June 17, 2026 09:48

tuanpach approved these changes Jun 22, 2026

View reviewed changes

This was referenced Jun 22, 2026

Allow ON CLUSTER <db_name> pass-through on Replicated / Shared databases for RENAME and EXCHANGE #106192

Open

Reapply "Add silk fibers aware socket implementations" #107680

Merged

Merge remote-tracking branch 'origin/master' into groeneai/fix-refres…

6a3d451

…h-task-shutdown-view-race-105548

alexey-milovidov merged commit 9af0c5d into ClickHouse:master Jun 26, 2026
166 of 167 checks passed

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 26, 2026

This was referenced Jun 27, 2026

Do not abort on block-number counter reset in ReplicatedMergeTreeSink #108566

Open

Make SYSTEM SYNC MERGES wait for part_log to be written #108584

Open

alexey-milovidov mentioned this pull request Jun 29, 2026

Add server setting additional_memory_tracking_per_thread #104965

Draft

32 tasks

groeneai mentioned this pull request Jun 29, 2026

Add optimize_row_order_if_no_order_by (reopen #103919) #104591

Open

1 task

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented May 21, 2026 • edited by robot-ch-test-poll2 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Version info

Uh oh!

groeneai commented May 21, 2026

Pre-PR validation gate

Uh oh!

clickhouse-gh Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

groeneai commented May 22, 2026

Uh oh!

tuanpach May 25, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai May 25, 2026

Choose a reason for hiding this comment

Uh oh!

tuanpach May 26, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai May 26, 2026

Choose a reason for hiding this comment

Uh oh!

tuanpach commented May 25, 2026

Uh oh!

groeneai commented May 25, 2026

Uh oh!

tuanpach commented May 26, 2026

Uh oh!

groeneai commented May 26, 2026

Uh oh!

tuanpach commented Jun 9, 2026

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

groeneai commented Jun 10, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

groeneai commented Jun 10, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

groeneai commented Jun 11, 2026

Uh oh!

Uh oh!

groeneai commented Jun 11, 2026

Uh oh!

groeneai commented Jun 11, 2026

Uh oh!

tuanpach commented Jun 17, 2026

Uh oh!

groeneai commented Jun 17, 2026

Uh oh!

groeneai commented Jun 17, 2026

Uh oh!

tuanpach commented Jun 22, 2026

Uh oh!

nikitamikhaylov commented Jun 23, 2026

Uh oh!

groeneai commented Jun 23, 2026

Uh oh!

groeneai commented May 21, 2026 •

edited by robot-ch-test-poll2

Loading

clickhouse-gh Bot commented May 22, 2026 •

edited

Loading

groeneai commented Jun 11, 2026 •

edited

Loading