Fix CREATE OR REPLACE MATERIALIZED VIEW ... POPULATE losing the source subscription#108728
Fix CREATE OR REPLACE MATERIALIZED VIEW ... POPULATE losing the source subscription#108728alexey-milovidov wants to merge 9 commits into
Conversation
…e subscription `CREATE OR REPLACE MATERIALIZED VIEW ... POPULATE` left the new view unsubscribed from its source table, so every row inserted after the replace was silently dropped, including rows inserted concurrently with the replace. The replace creates a temporary view, runs the `POPULATE` `INSERT` into it, then atomically swaps it with the target via `EXCHANGE`. The `POPULATE` `INSERT` resolves the temporary table by name and caches its name -> UUID mapping in the per-query storage cache (`Context::getOrCacheStorage`). After the `EXCHANGE` that UUID is the new, live view, but the temporary name now refers to the old table that is about to be dropped. The internal `DROP` resolved the temporary name through the stale cache and shut down the live view instead, detaching it from its source. Fix `getOrCacheStorage` to treat a cached entry as stale when the resolved storage no longer has the requested name (it was renamed or reassigned within the same query) and re-resolve it by name. Closes: #108726 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses review feedback on #108728: the existing test only inserted into the source after `CREATE OR REPLACE MATERIALIZED VIEW ... POPULATE` had completed, so it covered the deterministic detached-view symptom but not the concurrent scenario the issue describes (rows inserted while the replace is in progress). This adds `04490_..._concurrent.sh`, which inserts into the source concurrently with the replace (the snapshot window is widened via `merge_tree_storage_snapshot_sleep_ms` so the inserts reliably overlap it) and asserts that no source row is lost - the set of distinct ids in the view equals the set in the source. Verified non-flaky across many local runs against a build that carries the fix. The test asserts only that no row is lost (the contract this change restores). It deliberately does not assert exactly-once delivery: plain `POPULATE` is not atomic, so a concurrently inserted row can be both captured by the snapshot and delivered through the subscription, i.e. duplicated. Making `POPULATE` exactly-once is a separate change. The existing deterministic test is renumbered from `04415` to `04489` because the merge with master introduced colliding `04415_*` tests.
|
Pushed
On the review point about the dependency-graph window during the rename/ CI: the only red is |
The per-query storage cache (`Context::getOrCacheStorage`) is keyed by qualified name only (`StorageCache::Shard::set` uses `StorageID::DatabaseAndTableNameEqual`), so a cache hit can carry a UUID that differs from the `StorageID` being resolved. After a same-name replacement the name can map to UUID B while a caller still asks for UUID A; the freshness check only compared names, so it would hand out UUID B instead of letting the fresh lookup report `UNKNOWN_TABLE`/`TABLE_UUID_MISMATCH`. Require UUID equality on the cache-hit path when the requested `StorageID` has a UUID. This only makes the cache more conservative (more re-resolutions through the authoritative lookup); it never returns a different table than requested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ent no-loss The concurrent regression test asserted that no source row is lost when inserts race the replace (`uniqExact(mv) = uniqExact(src)`). That contract is not what this PR provides: the source-to-view dependency transfer inside the internal `EXCHANGE` (`InterpreterRenameQuery`) is not yet atomic, so a row inserted in the brief instant the replace swaps the tables can still be missed. Under the sanitizer and loaded CI runs that window opened reliably, so the test failed across Fast test and every flaky check. Rewrite it to prove only the deterministic fix this PR makes: once the (concurrent) replace has completed, the new view is still subscribed to its source, so a sentinel row inserted after everything settles must reach the view. This is deterministic and independent of the residual race. Also enable `set -e`, keep the background-insert PID and `wait` on it explicitly, so a failed insert fails the test (previously the bare `wait` could not). Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108728&sha=c86d0bc3a50d1575af7560811b4db5532cbca979&name_0=PR&name_1=Fast%20test%20%28arm_darwin%29 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the AI review's "Needs changes" verdict and the CI failure.
Merged |
The flaky check failed `04490_..._concurrent` because the last revision enabled
`set -e` and `wait`-ed on the background inserts "so a failed insert fails the
test". But a concurrent `INSERT INTO src` that lands in the brief instant the
replace swaps the tables can legitimately fail with `UNKNOWN_TABLE` ("Target
table ... of view ... doesn't exist"): the source-to-view dependency transfer
inside the internal `EXCHANGE` is not yet atomic, so the new view's target table
is momentarily unreachable. That is precisely the residual window this PR
documents as out of scope, so failing the test on it contradicts the test's own
stated contract (assert subscription liveness, not concurrent delivery).
Make the concurrent inserts tolerate that specific `UNKNOWN_TABLE` error while
still failing the test on any other insert error. The assertion is unchanged: a
sentinel row inserted after everything settles must reach the view.
Verified against a local build carrying the fix: the rewritten test passes 50/50
runs, and it still fails (sentinel count 0) against a build without the fix, so
it continues to catch the regression. The error-handling itself was unit-checked
to tolerate `UNKNOWN_TABLE` and re-raise any other error.
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108728&sha=84561e343cc5b95655a0d81ef488845931b9d82b&name_0=PR&name_1=Stateless%20tests%20%28amd_debug%2C%20flaky%20check%29
PR: #108728
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Fixed the flaky check on this PR's own test (
|
`master` now carries `04489_max_threads_auto_parsing_compat` and two `04490_*` tests, which collided with the deterministic and concurrent regression tests added here. Renumber them to the next free prefixes `04492`/`04493` and update the cross-reference comment in the concurrent test. Test-only change; the test bodies are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Cleared the AI-Review "Request changes" verdictIts sole minimum-required action was the issue relationship. I changed This is the conservative direction: under-claiming what the PR fixes is safer than Renumbered the regression tests off a fresh
|
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 7/7 (100.00%) · Uncovered code |

Related: #108726
Related: #108715
CREATE OR REPLACE MATERIALIZED VIEW ... POPULATEleft the new view unsubscribed from its source table, so every row inserted after the replace was silently dropped.The replace creates a temporary view, runs the
POPULATEINSERTinto it, then atomically swaps it with the target viaEXCHANGE. ThePOPULATEINSERTresolves the temporary table by name and caches its name → UUID mapping in the per-query storage cache (Context::getOrCacheStorage). After theEXCHANGEthat UUID is the new, live view, but the temporary name now refers to the old table that is about to be dropped. The internalDROPthen resolved the temporary name through the stale cache and shut down the live view instead, detaching it from its source.The fix makes
getOrCacheStoragetreat a cached entry as stale when the resolved storage no longer has the requested name (it was renamed or reassigned within the same query, e.g. by the internalEXCHANGEofCREATE OR REPLACE), or when the caller asked for a specific UUID that the name-keyed cache no longer matches. In those cases the entry is dropped and the table is re-resolved by name. This is a natural extension of the existing staleness check that already re-resolves when the cached UUID no longer exists.With the fix the new view stays subscribed to its source after the replace, so inserts that follow the replace reach the view again.
Note on scope: this fixes the deterministic detach. A separate, much narrower window remains — the source-to-view dependency transfer inside the
EXCHANGE(InterpreterRenameQuery) is not yet atomic, so a row inserted concurrently, in the brief instant the replace is swapping the tables, can still be missed (and an insert that lands exactly mid-swap can even fail withUNKNOWN_TABLE). Making that transfer atomic (andPOPULATEexactly-once) is left for a follow-up. Because #108726 asks for full concurrent exactly-once delivery, which is only partially addressed here, this PR is markedRelated:rather thanCloses:and #108726 stays open to track that residual concurrent contract (see also #108715).Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed
CREATE OR REPLACE MATERIALIZED VIEW ... POPULATEleaving the new view unsubscribed from its source table, which silently dropped every row inserted after the replace.Documentation entry for user-facing changes