iframe-proxy

groeneai · 2026-06-25T05:46:54Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a server crash (LOGICAL_ERROR: 'Unexpected exception in refresh scheduling') that could happen on startup when a coordinated refreshable materialized view in a Replicated database was attached on a Keeper that does not support the MULTI_READ feature flag (for example after a Keeper downgrade). The view is now stopped gracefully instead of aborting the server.

Description

Closes: #106737 (STID 2508-3e7b)

A coordinated refreshable materialized view used to abort the server when re-attached on a Keeper without MULTI_READ. The constructor's up-front MULTI_READ / CREATE_IF_NOT_EXISTS check was gated behind the fresh-CREATE path (!attach && !is_restore_from_backup) inside the !replica_path_existed branch, so on ATTACH / restore / existing-replica the view stayed coordinated = true. The scheduling thread then threw NOT_IMPLEMENTED in readZnodesIfNeeded (RefreshTask.cpp:1378), and the doScheduling catch-all re-raised it as a LOGICAL_ERROR (RefreshTask.cpp:936), which aborts the server in debug/sanitizer builds and crash-loops on restart.

Inner exception and stack from the reported run (amd_msan):

<Error> RefreshTask(...): Unexpected exception in refresh scheduling. The view will be stopped.: Code: 48. DB::Exception: Keeper server doesn't support multi-reads. Refreshable materialized views won't work. (NOT_IMPLEMENTED)
  5. src/Storages/MaterializedView/RefreshTask.cpp:1378  DB::RefreshTask::readZnodesIfNeeded(...)
  6. src/Storages/MaterializedView/RefreshTask.cpp:698   DB::RefreshTask::doScheduling(bool)
<Fatal> : Logical error: 'Unexpected exception in refresh scheduling'.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103786&sha=04c29e862d95e40e250aff7eb504de5d1af13d33&name_0=PR&name_1=Stress%20test%20%28amd_msan%29

Fix

The feature-flag check is moved before the znode-create block, so it runs on every coordinated path:

Fresh CREATE still rejects with NOT_IMPLEMENTED (unchanged behavior).
ATTACH / restore / existing-replica now disable coordination, request stop, and record the reason. readZnodesIfNeeded returns early (!coordinated), so the scheduling thread never reaches the throw, and the view ends up Disabled instead of crashing the server.

This is the approach prescribed by the maintainer in the issue (detect the missing MULTI_READ up front; fresh CREATE keeps rejecting, ATTACH/restore must not throw).

Test

Adds an integration test test_refreshable_mv_no_multi_read: it creates a coordinated refreshable MV in a Replicated database with MULTI_READ enabled, then restarts the server on a Keeper with MULTI_READ disabled. Without the fix the server crash-loops with the LOGICAL_ERROR above; with the fix the server stays up and the view reports Disabled with the reason in system.view_refreshes.exception.

Version info

Merged into: 26.7.1.256

groeneai · 2026-06-25T05:47:16Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. Integration test `test_refreshable_mv_no_multi_read`: create a coordinated refreshable MV in a `Replicated` DB with `MULTI_READ` on, restart on a Keeper with `MULTI_READ` off. On the unfixed binary the server crash-loops every time with `Logical error: 'Unexpected exception in refresh scheduling'` (RefreshTask.cpp:1378 -> 698 -> 936).
b	Root cause explained?	The constructor's `MULTI_READ`/`CREATE_IF_NOT_EXISTS` check was gated behind the fresh-CREATE path (`!attach && !is_restore_from_backup`) inside `!replica_path_existed`. On ATTACH/restore/existing-replica it is skipped, so the view stays `coordinated=true`; the scheduling thread then throws `NOT_IMPLEMENTED` in `readZnodesIfNeeded`, which the `doScheduling` catch-all re-raises as `LOGICAL_ERROR` -> server abort.
c	Fix matches root cause?	Yes. The check is moved before the znode-create block so it runs on every coordinated path. Fresh CREATE still throws; ATTACH/restore disable coordination + request stop, so `readZnodesIfNeeded` returns early (`!coordinated`) and the throw site is never reached.
d	Test intent preserved / new tests added?	New integration test added (none weakened). It asserts the server stays up, the view is `Disabled`, the reason is reported in `system.view_refreshes.exception`, and no `Unexpected exception in refresh scheduling` is logged.
e	Both directions demonstrated?	Yes. Unfixed binary (Build ID 26ca63af...): server crash-loops, test FAILS. Fixed binary (Build ID 2b06a3dd... after rebase onto master): test PASSES. Both on Build-ID-verified binaries.
f	Fix is general across code paths?	Yes. The single up-front check now covers all three coordinated entry paths (fresh CREATE, ATTACH, restore/existing-replica). Fresh CREATE preserves the existing reject behavior; the two non-throwing paths share one graceful-stop branch. Verified the early-return interacts correctly with `readZnodesIfNeeded` (`!coordinated` guard), `doScheduling` (`stop_requested` -> `Disabled`), `finalizeRestoreFromBackup`/`start`/`startup`.
g	Fix generalizes across inputs?	Yes. The feature-flag predicate is independent of MV data types / column wrappers; it depends only on the Keeper feature flags and the create-mode flags (`attach`, `is_restore_from_backup`). Both missing flags (`MULTI_READ`, `CREATE_IF_NOT_EXISTS`) are handled by the same branch.
h	Backward compatible?	Yes. No setting, on-disk, wire, or metadata format change. Fresh CREATE behavior is unchanged. The only behavior change is on the previously-crashing ATTACH/restore path, which now stops the view gracefully instead of aborting the server.
i	Invariants and contracts preserved?	Yes. The early return runs inside the `if (coordinated)` block before any znode is created, leaving `coordination.coordinated=false` consistent with the uncoordinated branch. `readZnodesIfNeeded`'s `!coordinated` precondition for early-return holds; `doScheduling`'s `stop_requested -> Disabled` contract is upheld; no Keeper access happens on the stopped path.

Session id: cron:clickhouse-worker-slot-1:20260625-040002

groeneai · 2026-06-25T05:48:09Z

cc @al13n321 @tuanpach could you review? Moves the up-front Keeper feature-flag check in the RefreshTask constructor to cover the ATTACH/restore/existing-replica paths, so a coordinated refreshable MV on a Keeper without MULTI_READ stops gracefully (Disabled) instead of throwing on the scheduling thread and aborting the server (#106737).

clickhouse-gh · 2026-06-25T06:33:06Z

Workflow [PR], commit [5aa9c52]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR moves the Keeper MULTI_READ / CREATE_IF_NOT_EXISTS validation for coordinated refreshable materialized views so ATTACH, restore, and existing-replica startup paths record a non-resumable unavailable state and stay Disabled instead of surfacing the scheduler LOGICAL_ERROR. The current head keeps coordination.coordinated true, blocks local resume paths, and adds focused integration coverage for the downgrade/restart path and SYSTEM START VIEW; the two prior bot threads are addressed in the current code. I did not find unresolved code-level findings.

PR Metadata

💡 The Bug Fix category is appropriate and the changelog entry is specific. Optional wording adjustment: ClickHouse text normally describes LOGICAL_ERROR cases as exceptions rather than server-level failures. Suggested changelog entry:

Fixed a LOGICAL_ERROR: 'Unexpected exception in refresh scheduling' exception that could happen on startup when a coordinated refreshable materialized view in a Replicated database was attached on a Keeper that does not support the MULTI_READ feature flag, for example after a Keeper downgrade. The view is now stopped gracefully instead of surfacing the scheduler exception.

Final Verdict

Status: ✅ Approve

No required code changes.

…/restore A coordinated refreshable materialized view (in a Replicated database) crashed the server when re-attached on a Keeper that lacks the MULTI_READ feature flag, e.g. after a Keeper downgrade. The constructor's up-front MULTI_READ / CREATE_IF_NOT_EXISTS check was gated behind the fresh-CREATE path (!attach && !is_restore_from_backup) inside the !replica_path_existed branch, so ATTACH / restore / existing-replica kept coordination enabled. The scheduling thread then threw NOT_IMPLEMENTED in readZnodesIfNeeded, which the doScheduling catch-all re-raised as a LOGICAL_ERROR, aborting the server (and crash-looping on restart). Move the feature-flag check before the znode-create block so it runs on every coordinated path. Fresh CREATE still rejects with NOT_IMPLEMENTED. On ATTACH / restore / existing-replica the view now enters a permanent, non-resumable "coordination unavailable" state: a new coordination.unavailable flag is set, the task is stopped, and the reason is recorded. doScheduling short-circuits to Disabled before touching Keeper, so readZnodesIfNeeded is never reached and the server stays up. coordination.coordinated is left true on purpose. Clearing it would turn an originally coordinated view into an uncoordinated one, and a later SYSTEM START VIEW or finalizeRestoreFromBackup would then resume it as a local refresh, which breaks the replicated target table of a non-APPEND view in a Replicated database. start() and finalizeRestoreFromBackup() now also check the unavailable flag and refuse to resume, so the view stays Disabled until the table is re-created on a Keeper that supports the required feature flags. Adds an integration test that creates a coordinated refreshable MV with MULTI_READ enabled, restarts on a Keeper with MULTI_READ disabled and asserts the server stays up with the view Disabled, and then asserts SYSTEM START VIEW does not resume an uncoordinated local refresh. Closes: ClickHouse#106737 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-25T07:44:18Z

Updated after the review feedback (the fix now uses a non-resumable coordination.unavailable state instead of clearing coordination.coordinated).

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. Integration test `tests/integration/test_refreshable_mv_no_multi_read/`: coordinated RMV in a Replicated DB, restart on a Keeper with `multi_read=0`. Unfixed binary crash-loops with the exact #106737 signature (RefreshTask.cpp readZnodesIfNeeded NOT_IMPLEMENTED -> doScheduling -> catch-all LOGICAL_ERROR -> abort).
b	Root cause explained?	The ctor MULTI_READ/CREATE_IF_NOT_EXISTS check was gated to fresh CREATE only, so on ATTACH/restore/existing-replica a coordinated view kept `coordinated=true` on a Keeper without MULTI_READ. The scheduling thread then threw NOT_IMPLEMENTED in `readZnodesIfNeeded`, re-raised by the `doScheduling` catch-all as a LOGICAL_ERROR -> server abort. The review also found that the first fix (clearing `coordinated`) demoted the view to uncoordinated, so `SYSTEM START VIEW`/`finalizeRestoreFromBackup` could resume it as a local refresh and corrupt the replicated target.
c	Fix matches root cause?	Yes. The feature-flag check runs on every coordinated path. Fresh CREATE still throws. ATTACH/restore enters a permanent non-resumable `coordination.unavailable` state; `doScheduling` short-circuits to `Disabled` before any Keeper access, so the throw is never reached. `coordinated` stays true, so the view is never demoted to a local refresh; `start()`/`finalizeRestoreFromBackup()` refuse to resume.
d	Test intent preserved / new tests added?	New integration test. Asserts (1) server stays up + view `Disabled` + exception mentions multi-read + no `Unexpected exception in refresh scheduling`, and (2) `SYSTEM START VIEW` after the graceful stop keeps it `Disabled` (does not resume an uncoordinated local refresh).
e	Both directions demonstrated?	Yes. Fixed binary: test PASSES. Original master (Build ID 26ca63af): crash-loops with the #106737 signature (test FAILS). Previous `coordinated=false` approach: test FAILS on the new step (`SYSTEM START VIEW` resumes to `Scheduled`).
f	Fix is general across code paths?	Yes. The single `unavailable` flag is honored by all resume/scheduling entry points: `doScheduling`, `start()` (SYSTEM START VIEW), `finalizeRestoreFromBackup()` (restore auto-resume). `drop()` keeps working since `coordinated` stays true and it uses only single-key/multi-write Keeper ops (no MULTI_READ).
g	Fix generalizes across inputs?	The trigger is a Keeper capability (missing MULTI_READ or CREATE_IF_NOT_EXISTS), not a datatype. Both missing flags take the same path; both attach and restore-from-backup are covered; fresh CREATE keeps rejecting. No type-wrapper dimension applies.
h	Backward compatible?	Yes. No setting default, on-disk, wire, or metadata format change. Fresh CREATE behavior is unchanged (still rejects). The only behavior change is on the previously-crashing ATTACH/restore path, which now stays up Disabled.
i	Invariants and contracts preserved?	Yes. Key invariant: a coordinated (non-APPEND, Replicated-DB) RMV must never run an uncoordinated local refresh, since `StorageMaterializedView` requires coordination for the replicated target. Leaving `coordinated=true` + making the state non-resumable upholds it on all paths (scheduling, SYSTEM START VIEW, restore finalize). `doScheduling` holds `mutex` across the early `Disabled` return; no Keeper access on the unavailable path.

Session id: cron:clickhouse-worker-slot-2:20260625-071500

…ti-read test SYSTEM REFRESH VIEW is asynchronous: RefreshTask::run sets the state to Scheduling synchronously and the background scheduler only later reaches the coordination.unavailable branch and switches it back to Disabled. Reading system.view_refreshes right after the refresh request could therefore observe the transient Scheduling state and fail the assertion spuriously. Add SYSTEM WAIT VIEW after the refresh request (and after the downgrade restart, which schedules the same graceful-stop pass asynchronously) so the status read always waits for the scheduled pass. WAIT VIEW returns immediately once the view is Disabled, so it does not block on the coordination-unavailable view. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-25T13:26:49Z

CI finish ledger — `cdabef4`

Every failure below has an owner. The single failure is unrelated to this PR's RefreshTask/Keeper diff.

Check / test	Reason	Owner
Stress test (arm_release) / S3_ERROR "No such key thrown" (CacheMetadata background download)	chronic trunk stress (filesystem-cache background-download vs concurrent object removal; ~8 unrelated PRs + master over 30d)	a fix task is created (investigating at full effort — fixing PR link to follow here)

This PR changes only RefreshTask.cpp/.h and a refreshable-MV integration test, which cannot touch the S3 object-storage / filesystem-cache path.

Session id: cron:our-pr-ci-monitor:20260625-130000

groeneai · 2026-06-25T13:44:51Z

Fixing PR for the Stress test S3_ERROR No such key thrown (CacheMetadata background download) failure created: #108489

alexey-milovidov

This looks reasonable.

However, the case when it happens is very narrow - likely it happens only in tests.

Still good to merge.

alexey-milovidov · 2026-06-29T03:43:05Z

@groeneai, provide a fix for S3_ERROR No such key thrown (see clickhouse-server.log or no_such_key_errors.txt), and if it is already fixed - update the branch here.

…htask-multiread-attach-106737

groeneai · 2026-06-29T03:55:42Z

The S3_ERROR No such key thrown in the Stress test is the chronic filesystem-cache background-download race: CacheMetadata prefetches an object that is concurrently removed during stress, logging a benign "Error during background download ... No such key". It is not caused by this PR, which only touches RefreshTask and Keeper (no S3 or filesystem-cache code).

It is already fixed by #108489 (merged 2026-06-28), which adds that benign line to the stress harness allowlist in tests/docker_scripts/stress_tests.lib. I have updated this branch onto current master (head now 5aa9c52) so the Stress test picks up the allowlist on re-run.

clickhouse-gh · 2026-06-29T07:28:14Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	85.40%	85.40%	+0.00%
Functions	92.60%	92.60%	+0.00%
Branches	77.70%	77.60%	-0.10%

Changed lines: Changed C/C++ lines covered by tests: 30/32 (93.75%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

groeneai · 2026-06-29T08:30:37Z

CI finish ledger — `5aa9c52`

After merging master to pick up the merged stress-harness fix #108489, CI fully finished green on this head. The only red check is the external private mirror sync.

The previously-tracked Stress test (arm_release) / S3_ERROR No such key line is resolved: fixing PR #108489 (ours, merged 2026-06-28) is now on this branch via the master merge.

Session id: cron:our-pr-ci-monitor:20260629-080000

groeneai mentioned this pull request Jun 25, 2026

Logical error: 'Unexpected exception in refresh scheduling' (STID: 2508-3e7b) #106737

Closed

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 25, 2026

murphy-4o added the can be tested Allows running workflows for external contributors label Jun 25, 2026

murphy-4o assigned murphy-4o and unassigned murphy-4o Jun 25, 2026

clickhouse-gh Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread src/Storages/MaterializedView/RefreshTask.cpp Outdated

murphy-4o self-assigned this Jun 25, 2026

groeneai force-pushed the groeneai/fix-refreshtask-multiread-attach-106737 branch from 98c1022 to 67a1d74 Compare June 25, 2026 07:42

clickhouse-gh Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread tests/integration/test_refreshable_mv_no_multi_read/test.py

alexey-milovidov approved these changes Jun 28, 2026

View reviewed changes

alexey-milovidov self-assigned this Jun 28, 2026

groeneai mentioned this pull request Jun 29, 2026

Support CAST from QBit to Array #108072

Merged

1 task

Merge remote-tracking branch 'origin/master' into groeneai/fix-refres…

5aa9c52

…htask-multiread-attach-106737

alexey-milovidov mentioned this pull request Jun 29, 2026

Add pcodec (PCO) compression codec #106222

Open

1 task

alexey-milovidov merged commit 931d5fd into ClickHouse:master Jun 29, 2026
173 of 174 checks passed

robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 29, 2026

groeneai mentioned this pull request Jun 29, 2026

Add optimize_row_order_if_no_order_by (reopen #103919) #104591

Open

1 task

alexey-milovidov mentioned this pull request Jun 30, 2026

MaterializedPostgreSQL: don't break all replication on one table's structure mismatch #107427

Merged

1 task

groeneai mentioned this pull request Jun 30, 2026

Parallel read in order with multiple parts #100394

Open

alexey-milovidov mentioned this pull request Jun 30, 2026

Fix **/ (globstar) to match zero or more directory components #97676

Merged

This was referenced Jun 30, 2026

Fix flaky test_refreshable_mv_attach_without_multi_read #108889

Closed

Stop refreshable MV gracefully when the Keeper feature-flag check races on attach #108890

Open

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 25, 2026 • edited by robot-ch-test-poll4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Fix

Test

Version info

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

PR Metadata

Final Verdict

Uh oh!

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

Uh oh!

groeneai commented Jun 25, 2026

CI finish ledger — cdabef4

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Jun 29, 2026

Uh oh!

groeneai commented Jun 29, 2026

Uh oh!

clickhouse-gh Bot commented Jun 29, 2026

LLVM Coverage Report

Uh oh!

groeneai commented Jun 29, 2026

CI finish ledger — 5aa9c52

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

groeneai commented Jun 25, 2026 •

edited by robot-ch-test-poll4

Loading

clickhouse-gh Bot commented Jun 25, 2026 •

edited

Loading

CI finish ledger — `cdabef4`

CI finish ledger — `5aa9c52`