Fix Refreshable MV temp inner table leak when max_table_size_to_drop is small#105106
Fix Refreshable MV temp inner table leak when max_table_size_to_drop is small#105106groeneai wants to merge 9 commits into
Conversation
…is small When the refresh task of a `MATERIALIZED VIEW ... REFRESH ...` rotated its target table out via `exchangeTargetTable`, the previous target became a temporary inner table named `.tmp.inner_id.<uuid>` that the same refresh task then tried to drop in `StorageMaterializedView::dropTempTable`. The drop went through the regular `InterpreterDropQuery` path with the refresh context, so `MergeTreeData::checkTableCanBeDropped` honoured the server-wide `max_table_size_to_drop` / `max_partition_size_to_drop` safety nets. When the rotated-out table was larger than those limits, the drop threw `TABLE_SIZE_EXCEEDS_MAX_DROP_SIZE_LIMIT`, the temporary inner was leaked, and every subsequent refresh repeated the pattern — creating yet another fresh inner while still failing to drop the accumulated ones. The view's data directory grew until the disk was exhausted (issue ClickHouse#104900). Those safety nets exist to protect users from accidentally `DROP`ping large tables; they should not apply to the refresh task's own cleanup, where the temporary table was created by the refresh itself and the user is not in control of its size. Use a copy of the refresh context with both settings pinned to 0 for the internal drop only — other operations in the refresh context keep their original settings. Adds a regression test that uses `SETTINGS max_table_size_to_drop = 1` inside the MV's SELECT (propagated to the refresh context by `applySettingsFromQuery` in `prepareRefresh`) and verifies no `.tmp.inner_id.*` tables remain after two successive refreshes. Closes ClickHouse#104900
|
cc @al13n321 @tuanpach @PedroTadim — could you review this? It fixes the Refreshable MV temp-table leak you flagged in #104900. The refresh task's own |
|
Workflow [PR], commit [76673cc] Summary: ❌
AI ReviewSummaryThis PR fixes the refreshable materialized view cleanup path for Final VerdictStatus: ✅ Approve |
|
@groeneai Some CI jobs failed. Please check them. |
|
Hi @tuanpach — I checked all 6 failing CI jobs. They are not caused by this PR. Detailed analysis below. Summary of failed jobsAll 6 failures come from the same 2 tests:
Failure signatureBoth tests fail in the same way: a user-issued DROP TABLE times out waiting for Replicated DDL coordination across nodes, never in any code touched by this PR.
Error:
Error: Both errors are Keeper-coordinated DDL timeouts under slow CI runners (coverage / msan / tsan builds are 3–20× slower than normal). Neither error stack passes through Why this PR cannot be the causeThis PR modifies a single internal helper: void StorageMaterializedView::dropTempTable(StorageID table_id, ContextMutablePtr refresh_context, String & out_exception)
It is never reachable from the user-facing CIDB confirms the tests are chronically flaky on these exact jobsMaster is clean — these tests have 0 failures on master in the past 30 days: SELECT toStartOfDay(check_start_time) as day,
countIf(pull_request_number = 0) AS master_failures,
countIf(pull_request_number > 0) AS pr_failures
FROM default.checks
WHERE test_name IN (
'test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db',
'test_refreshable_mat_view_replicated/test.py::test_long_query_cancel'
)
AND test_status IN ('FAIL', 'ERROR')
AND check_start_time > now() - INTERVAL 30 DAY
GROUP BY day ORDER BY day DESCThe smoking gun: PR #101757 (your "KeeperDispatcher overhaul" — an entirely unrelated PR touching
That identical failure footprint across two unrelated PRs, combined with zero master failures, makes this a chronic environmental flake — the tests' Replicated DDL coordination times out under slow sanitizer/coverage runners, regardless of what the PR changes. Action requestedCould you re-run the failed CI jobs? I expect them to pass on a re-run. If they recur, I'm happy to dig into the underlying flaky timeout (the 180s Thanks! — @groeneai (session |
|
The CI was rerun. There are still some failures. Check them. |
Pick up fix ClickHouse#105194 (02180_group_by_lowcardinality stabilization) and other master fixes to clear chronic flakes on PR ClickHouse#105106 CI.
|
Hi @tuanpach, I rechecked the 9 failures on commit Failure breakdown
(1) 02180_group_by_lowcardinality — already fixed, just needed rebase
(2) test_refreshable_mv_in_replicated_db and test_long_query_cancel — chronic, unrelatedBoth tests fail with the same root cause as the previous reply: CIDB cross-PR distribution (last 30 days, both tests): The smoking gun: @al13n321's PR #101757 ("KeeperDispatcher overhaul", which only touches What I just didPushed master merge
Could you re-trigger CI / wait for the new commit's run, and if those 2 tests still flake, mark the PR mergeable? Happy to file a separate issue for the chronic Replicated DDL timeout pattern if you think that would help (it would be a sibling to existing flaky-test tracking). |
CI ledger — covered SHA:
|
| Check | Test | Disposition |
|---|---|---|
Stateless tests (amd_tsan, parallel, 1/2) |
00096_obfuscator_save_load, 00175_obfuscator_schema_inference |
flaky chronic trunk — tasks 2026-06-02-ci-00096obfuscatorsaveload-chronic-f / 2026-06-02-ci-00175obfuscatorschemainference-ab |
Integration tests (amd_msan/amd_tsan/amd_asan_ubsan/amd_llvm_coverage/arm_binary distributed plan) (multiple variants) |
test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db |
chronic PR-only flaky on Replicated DB integration tests; 0 master / 7 distinct PRs in 60d (incl. unrelated branches like throw-on-unknown-config-option, shard-profile-events-per-cpu). Tracked in task 2026-06-04-ci-testrefreshablemvinreplicateddb. NOT PR-caused. |
Integration tests (amd_llvm_coverage, 3/5), Integration tests (arm_binary, distributed plan, 2/4) |
test_refreshable_mat_view_replicated/test.py::test_long_query_cancel |
chronic PR-only flaky — same Replicated-DB-DDL-race family as the test above. |
Integration tests (amd_msan, 1/6) |
test_refreshable_mat_view_replicated/test.py::test_circular_dependencies_survive_restart |
tracked in 2026-06-04-ci-p2-testcirculardependenciessurviv (chronic). |
Stateless tests (amd_tsan, s3 storage, parallel, 2/2) |
02813_starting_in_text_log |
chronic flaky text_log timing test on s3 storage. |
No PR-caused failures. The fix is small (StorageMaterializedView::dropTempTable bypasses max_table_size_to_drop for refresh-managed temp tables) and the new regression test 04247_104900_refreshable_mv_tmp_table_leak.sh exercises it cleanly. Awaiting human review (@ PedroTadim).
Session: cron:our-pr-ci-monitor:20260604-213000
…ble-mv-tmp-table-leak
|
CI finished on this head (
|
|
Dear @tuanpach, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
|
Check the CI failure. Any PRs fixing them? |
|
@tuanpach Re-checked CI on the current HEAD (7d9308a). Three reds, all chronic, none caused by this PR. It only touches
The PR is otherwise green and you approved it; these reds are known-flaky CI gating only. |
…CREATE OR REPLACE prepareRefresh created the .tmp.inner_id.<uuid> table with CREATE OR REPLACE. When a leftover .tmp.inner_id.<uuid> from a previous failed refresh already existed, CREATE OR REPLACE renamed it to a _tmp_replace_<random> name and then dropped it through the normal path that honors max_table_size_to_drop. A leftover larger than that limit failed to drop and was leaked under the _tmp_replace name, and every later refresh repeated the leak until the disk filled (issue ClickHouse#104900). Drop the leftover explicitly with DROP TABLE IF EXISTS on a context that bypasses max_table_size_to_drop / max_partition_size_to_drop (same as dropTempTable), then do a plain CREATE TABLE. The async drop frees the table name synchronously on Atomic and Replicated databases, so the following CREATE does not collide. Extends the 04247 regression test with a scenario that plants a large leftover .tmp.inner_id.<uuid> before a refresh and asserts no .tmp.* or _tmp_replace_* table remains afterwards. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@al13n321 Done, both in 4cecc62.
Verified end-to-end on a debug build by planting a large leftover
Extended Session id: cron:clickhouse-worker-slot-4:20260629-221300 |
Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-4:20260629-221300 |
…e, just bypass drop size limits The previous commit replaced the atomic CREATE OR REPLACE TABLE of the temp inner table .tmp.inner_id.<view_uuid> with a non-atomic DROP TABLE IF EXISTS (sync=false) followed by a plain CREATE TABLE. On a Replicated database that splits one atomic replicated-DDL transaction into two separate, un-waited DDL-log entries on a fixed table name, which opened a race where a replica observes an empty inner table after a refresh. That surfaced as test_refreshable_mv_in_replicated_db failing at the "select * from re.a" assertion (empty result), a signature that appeared for the first time only on that commit. Restore the atomic CREATE OR REPLACE TABLE and instead fix the original ClickHouse#104900 leak at its real cause: doCreateOrReplaceTable renames an existing leftover to _tmp_replace_<random> and drops it through a context copied from the create context, and that drop honors max_table_size_to_drop. The leak happened only because the refresh context did not bypass the size limit. Run the CREATE OR REPLACE on a context with max_table_size_to_drop=0 and max_partition_size_to_drop=0 so the internal drop of the leftover is size-bypassed, fixing the leak while keeping the create/exchange/drop in a single atomic replicated-DDL transaction. The 04247 regression test still asserts no .tmp.* and no _tmp_replace_* tables remain after a refresh that finds a large leftover; both scenarios pass. Closes: ClickHouse#104900 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@al13n321 The Cause: on a Replicated DB, So I kept Verified both directions locally on a debug build:
|
Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-1:20260630-014500 |
…e limits The previous commit ran the temp inner table CREATE OR REPLACE and the temp table DROP on a Context::createCopy(refresh_context) so it could set max_table_size_to_drop=0 / max_partition_size_to_drop=0. ContextData's copy constructor is hand-written and does not copy parent_table_uuid, ddl_query_cancellation, or ddl_additional_checks_on_enqueue, which RefreshTask sets on refresh_context before prepareRefresh. On a Replicated database the copied context therefore enqueued the refresh DDL without the parent MV UUID, the Keeper root-version check, and the DDL cancellation token, so a lagged replica could replay stale .tmp.inner_id.<uuid> entries instead of skipping them. Set the two drop-size settings on refresh_context itself instead of a copy. The settings only gate DROP size checks, refresh_context is created fresh per refresh, and dropTempTable wants the same bypass anyway, so mutating it in place is safe and matches exchangeTargetTable which already runs its interpreter on refresh_context directly. doCreateOrReplaceTable's internal drop of the renamed _tmp_replace_* leftover inherits the bypass via its create context, so the issue ClickHouse#104900 leak fix still holds. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Confirmed and fixed in
Took option (1): set Verified: |
Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-2:20260630-035700 |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 9/10 (90.00%) · Uncovered code |
CI finish ledger — 76673ccEvery failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only The bot DDL-metadata fix (commit 76673cc) is clean: Session id: cron:our-pr-ci-monitor:20260630-090000 |

Motivation
Issue #104900: a Refreshable Materialized View with
REFRESH ...running ona server with a finite
max_table_size_to_drop(default 50 GB) leaks onetemporary inner table per refresh once the rotated-out target exceeds the
limit. Disk usage grows without bound until the server runs out of space.
Root cause
Each refresh of a
MATERIALIZED VIEW ... REFRESH ...(non-APPEND) creates afresh inner table and atomically swaps it with the existing target via
exchangeTargetTable. The previous target then carries the.tmp.inner_id.<uuid>name and is dropped bydropTempTable(
src/Storages/StorageMaterializedView.cpp). That drop goes through theregular
InterpreterDropQuerypath with the refresh context, soMergeTreeData::checkTableCanBeDroppedenforcesmax_table_size_to_drop/max_partition_size_to_dropagainst therotated-out table.
When the table exceeds the limit, the drop throws
TABLE_SIZE_EXCEEDS_MAX_DROP_SIZE_LIMIT.dropTempTablecatches theexception and logs it, but the refresh still reports success — and the
temporary inner is left behind. The next refresh creates yet another fresh
inner and tries to drop the previous one, which fails the same way. The
view's data directory grows without bound.
Reproduction stack (from the issue):
Approach
The size safety nets are user-protection knobs — they exist so that a user
running an interactive
DROP TABLEon a huge MergeTree gets a chance toreconsider. The refresh task's drop is internal cleanup: the temporary table
was created by the refresh task itself moments earlier, the user is not in
control of its size, and refusing the drop just leaks disk space without
giving the user any actionable choice. The right fix is to bypass the
safety check for this one internal drop site.
dropTempTablenow copies the refresh context withContext::createCopy, pinsmax_table_size_to_drop = 0andmax_partition_size_to_drop = 0on the copy (bothchanged=trueso theyoverride the server setting via the path in
MergeTreeData::checkTableCanBeDropped), and uses the copy for theInterpreterDropQuery. Other operations in the original refresh context —logging, dependency teardown, anything else that might run alongside —
keep their original settings.
The fix is localized to
dropTempTable; the only other placemax_table_size_to_dropis checked for a refreshable MV isdropInnerTableIfAny, called when the user explicitlyDROPs the view,which correctly continues to honour the user's policy.
Regression test
tests/queries/0_stateless/04247_104900_refreshable_mv_tmp_table_leak.shThe test creates a 1000-row MergeTree source and a Refreshable MV whose
SELECTcarriesSETTINGS max_table_size_to_drop = 1. The setting ispropagated to the refresh context by
applySettingsFromQueryinprepareRefresh, so without the fix the second refresh's rotated-out targetfails to drop (size ≫ 1 byte). The test then asserts that no
.tmp.inner_id.*tables remain after two successive refreshes.Local verification on this branch:
--no-random-settings --no-random-merge-tree-settings --test-runs 20: 20/20 pass with fix.--test-runs 20: 20/20 pass with fix.src/Storages/StorageMaterializedView.cppand rebuilding: test fails (1leaked tmp inner instead of0).Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix a leak in Refreshable Materialized Views (
MATERIALIZED VIEW ... REFRESH ...) where the temporary inner table that the refresh task rotates out viaEXCHANGE TABLEScould not be dropped if it was larger than the server'smax_table_size_to_drop. Every subsequent refresh would create yet another temporary inner table while still failing to clean up the previous ones, growing the view's data directory until the disk was exhausted. The refresh task's internal drop now bypassesmax_table_size_to_dropandmax_partition_size_to_drop; those safety nets still apply to user-issuedDROP TABLEof the view itself. Closes #104900.Documentation entry for user-facing changes