Fix Refreshable MV temp inner table leak when max_table_size_to_drop is small by groeneai · Pull Request #105106 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix Refreshable MV temp inner table leak when max_table_size_to_drop is small#105106

Open
groeneai wants to merge 9 commits into
ClickHouse:masterfrom
groeneai:fix-104900-refreshable-mv-tmp-table-leak
Open

Fix Refreshable MV temp inner table leak when max_table_size_to_drop is small#105106
groeneai wants to merge 9 commits into
ClickHouse:masterfrom
groeneai:fix-104900-refreshable-mv-tmp-table-leak

Conversation

@groeneai

Copy link
Copy Markdown
Contributor

Motivation

Issue #104900: a Refreshable Materialized View with REFRESH ... running on
a server with a finite max_table_size_to_drop (default 50 GB) leaks one
temporary inner table per refresh once the rotated-out target exceeds the
limit. Disk usage grows without bound until the server runs out of space.

Root cause

Each refresh of a MATERIALIZED VIEW ... REFRESH ... (non-APPEND) creates a
fresh inner table and atomically swaps it with the existing target via
exchangeTargetTable. The previous target then carries the
.tmp.inner_id.<uuid> name and is dropped by dropTempTable
(src/Storages/StorageMaterializedView.cpp). That drop goes through the
regular InterpreterDropQuery path with the refresh context, so
MergeTreeData::checkTableCanBeDropped enforces
max_table_size_to_drop / max_partition_size_to_drop against the
rotated-out table.

When the table exceeds the limit, the drop throws
TABLE_SIZE_EXCEEDS_MAX_DROP_SIZE_LIMIT. dropTempTable catches the
exception and logs it, but the refresh still reports success — and the
temporary inner is left behind. The next refresh creates yet another fresh
inner and tries to drop the previous one, which fails the same way. The
view's data directory grows without bound.

Reproduction stack (from the issue):

4. Context::checkCanBeDropped(...)
5. MergeTreeData::checkTableCanBeDropped(...)
6. InterpreterDropQuery::executeToTableImpl(...)
7. InterpreterDropQuery::executeSingleDropQuery(...)
8. InterpreterDropQuery::execute()
9. StorageMaterializedView::dropTempTable(StorageID, Context, String&)
10. RefreshTask::refreshTask()

Approach

The size safety nets are user-protection knobs — they exist so that a user
running an interactive DROP TABLE on a huge MergeTree gets a chance to
reconsider. The refresh task's drop is internal cleanup: the temporary table
was created by the refresh task itself moments earlier, the user is not in
control of its size, and refusing the drop just leaks disk space without
giving the user any actionable choice. The right fix is to bypass the
safety check for this one internal drop site.

dropTempTable now copies the refresh context with
Context::createCopy, pins max_table_size_to_drop = 0 and
max_partition_size_to_drop = 0 on the copy (both changed=true so they
override the server setting via the path in
MergeTreeData::checkTableCanBeDropped), and uses the copy for the
InterpreterDropQuery. Other operations in the original refresh context —
logging, dependency teardown, anything else that might run alongside —
keep their original settings.

The fix is localized to dropTempTable; the only other place
max_table_size_to_drop is checked for a refreshable MV is
dropInnerTableIfAny, called when the user explicitly DROPs the view,
which correctly continues to honour the user's policy.

Regression test

tests/queries/0_stateless/04247_104900_refreshable_mv_tmp_table_leak.sh

The test creates a 1000-row MergeTree source and a Refreshable MV whose
SELECT carries SETTINGS max_table_size_to_drop = 1. The setting is
propagated to the refresh context by applySettingsFromQuery in
prepareRefresh, so without the fix the second refresh's rotated-out target
fails to drop (size ≫ 1 byte). The test then asserts that no
.tmp.inner_id.* tables remain after two successive refreshes.

Local verification on this branch:

  • --no-random-settings --no-random-merge-tree-settings --test-runs 20: 20/20 pass with fix.
  • Default random settings, --test-runs 20: 20/20 pass with fix.
  • Reverting only src/Storages/StorageMaterializedView.cpp and rebuilding: test fails (1 leaked tmp inner instead of 0).

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a leak in Refreshable Materialized Views (MATERIALIZED VIEW ... REFRESH ...) where the temporary inner table that the refresh task rotates out via EXCHANGE TABLES could not be dropped if it was larger than the server's max_table_size_to_drop. Every subsequent refresh would create yet another temporary inner table while still failing to clean up the previous ones, growing the view's data directory until the disk was exhausted. The refresh task's internal drop now bypasses max_table_size_to_drop and max_partition_size_to_drop; those safety nets still apply to user-issued DROP TABLE of the view itself. Closes #104900.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

…is small

When the refresh task of a `MATERIALIZED VIEW ... REFRESH ...` rotated its
target table out via `exchangeTargetTable`, the previous target became a
temporary inner table named `.tmp.inner_id.<uuid>` that the same refresh task
then tried to drop in `StorageMaterializedView::dropTempTable`. The drop went
through the regular `InterpreterDropQuery` path with the refresh context, so
`MergeTreeData::checkTableCanBeDropped` honoured the server-wide
`max_table_size_to_drop` / `max_partition_size_to_drop` safety nets.

When the rotated-out table was larger than those limits, the drop threw
`TABLE_SIZE_EXCEEDS_MAX_DROP_SIZE_LIMIT`, the temporary inner was leaked, and
every subsequent refresh repeated the pattern — creating yet another fresh
inner while still failing to drop the accumulated ones. The view's data
directory grew until the disk was exhausted (issue ClickHouse#104900).

Those safety nets exist to protect users from accidentally `DROP`ping large
tables; they should not apply to the refresh task's own cleanup, where the
temporary table was created by the refresh itself and the user is not in
control of its size. Use a copy of the refresh context with both settings
pinned to 0 for the internal drop only — other operations in the refresh
context keep their original settings.

Adds a regression test that uses `SETTINGS max_table_size_to_drop = 1` inside
the MV's SELECT (propagated to the refresh context by `applySettingsFromQuery`
in `prepareRefresh`) and verifies no `.tmp.inner_id.*` tables remain after
two successive refreshes.

Closes ClickHouse#104900
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @al13n321 @tuanpach @PedroTadim — could you review this? It fixes the Refreshable MV temp-table leak you flagged in #104900. The refresh task's own dropTempTable was honouring max_table_size_to_drop against the rotated-out target, leaking one .tmp.inner_id.<uuid> per refresh once the target exceeded the limit; the fix bypasses that safety check for the refresh-internal drop only.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 17, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [76673cc]

Summary:

job_name test_name status info comment
Stress test (arm_tsan) FAIL
Hung check failed, possible deadlock found FAIL cidb, issue
Stress test (amd_tsan) ERROR

AI Review

Summary

This PR fixes the refreshable materialized view cleanup path for max_table_size_to_drop / max_partition_size_to_drop: internal temp-table cleanup now bypasses those user-protection limits for refresh-owned tables while keeping the replicated DDL metadata on the original refresh context. I found no remaining correctness issues in the current diff. The prior clickhouse-gh thread about Context::createCopy losing refresh DDL metadata is addressed by commit 76673cc1ad25f88dc765a61e6a12e9249afe6b83 and has been resolved.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 17, 2026
@tuanpach

Copy link
Copy Markdown
Member

@groeneai Some CI jobs failed. Please check them.

@groeneai

Copy link
Copy Markdown
Contributor Author

Hi @tuanpach — I checked all 6 failing CI jobs. They are not caused by this PR. Detailed analysis below.

Summary of failed jobs

All 6 failures come from the same 2 tests:

  • test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db (5 of 6 jobs)
  • test_refreshable_mat_view_replicated/test.py::test_long_query_cancel (2 of 6 jobs)
Failing check Failing test
Integration tests (amd_llvm_coverage, 3/5) test_long_query_cancel
Integration tests (amd_llvm_coverage, 5/5) test_refreshable_mv_in_replicated_db
Integration tests (amd_msan, 5/6) test_refreshable_mv_in_replicated_db
Integration tests (amd_tsan, 5/6) test_refreshable_mv_in_replicated_db
Integration tests (arm_binary, distributed plan, 2/4) test_long_query_cancel
Integration tests (arm_binary, distributed plan, 3/4) test_refreshable_mv_in_replicated_db

Failure signature

Both tests fail in the same way: a user-issued DROP TABLE times out waiting for Replicated DDL coordination across nodes, never in any code touched by this PR.

test_refreshable_mv_in_replicated_db (line 208):

drop table re.f

Error: ReplicatedDatabase DDL task /test/re_X/log/query-NNN is not finished on 1 of 2 hosts (1 of them are currently executing the task, 0 are inactive). Was waiting for 180.6 seconds, which is longer than distributed_ddl_task_timeout. (TIMEOUT_EXCEEDED)

test_long_query_cancel (line 223, in fn_setup_tables):

DROP TABLE IF EXISTS test_rmv ON CLUSTER default

Error: Timeout exceeded while receiving data from server. Waited for 300 seconds, timeout is 300 seconds.

Both errors are Keeper-coordinated DDL timeouts under slow CI runners (coverage / msan / tsan builds are 3–20× slower than normal). Neither error stack passes through StorageMaterializedView::dropTempTable, which is what this PR touches.

Why this PR cannot be the cause

This PR modifies a single internal helper:

void StorageMaterializedView::dropTempTable(StorageID table_id, ContextMutablePtr refresh_context, String & out_exception)

dropTempTable is called from exactly two sites, both in RefreshTask::executeRefreshUnlocked():

$ grep -rn "dropTempTable" src/
src/Storages/MaterializedView/RefreshTask.cpp:995:  view->dropTempTable(table_to_drop.value(), refresh_context, discard_error_message);
src/Storages/MaterializedView/RefreshTask.cpp:1020: view->dropTempTable(table_to_drop.value(), refresh_context, out_error_message);
src/Storages/StorageMaterializedView.h:150:        void dropTempTable(...)
src/Storages/StorageMaterializedView.cpp:675:      void StorageMaterializedView::dropTempTable(...)

It is never reachable from the user-facing DROP TABLE path. The user DROPs in the failing tests do not enter this function at all — they go through the standard InterpreterDropQueryIDatabase::dropTableStorageMaterializedView::drop chain, which is unchanged by this PR.

CIDB confirms the tests are chronically flaky on these exact jobs

Master is clean — these tests have 0 failures on master in the past 30 days:

SELECT toStartOfDay(check_start_time) as day,
       countIf(pull_request_number = 0) AS master_failures,
       countIf(pull_request_number > 0) AS pr_failures
FROM default.checks
WHERE test_name IN (
    'test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db',
    'test_refreshable_mat_view_replicated/test.py::test_long_query_cancel'
  )
  AND test_status IN ('FAIL', 'ERROR')
  AND check_start_time > now() - INTERVAL 30 DAY
GROUP BY day ORDER BY day DESC
day                   master_failures  pr_failures
2026-05-18 00:00:00   0                6   (← this PR)
2026-05-16 00:00:00   0                14  (PR #101757)
2026-05-05 00:00:00   0                2   (PR #104051)
2026-05-04 00:00:00   0                1   (PR #104051)
2026-05-01 00:00:00   0                1   (PR #103849)

The smoking gun: PR #101757 (your "KeeperDispatcher overhaul" — an entirely unrelated PR touching src/Coordination/ and contrib/NuRaft, with zero changes to refreshable MV code) hit the EXACT SAME 6 jobs with the EXACT SAME 2 tests on 2026-05-16:

Job PR #105106 (this PR) PR #101757 (Keeper overhaul)
Integration tests (amd_tsan, 5/6) 1 fail 2 fails
Integration tests (amd_msan, 5/6) 1 fail 2 fails
Integration tests (amd_llvm_coverage, 5/5) 1 fail 2 fails
Integration tests (amd_llvm_coverage, 3/5) 1 fail 2 fails
Integration tests (arm_binary, distributed plan, 2/4) 1 fail 2 fails
Integration tests (arm_binary, distributed plan, 3/4) 1 fail 2 fails

That identical failure footprint across two unrelated PRs, combined with zero master failures, makes this a chronic environmental flake — the tests' Replicated DDL coordination times out under slow sanitizer/coverage runners, regardless of what the PR changes.

Action requested

Could you re-run the failed CI jobs? I expect them to pass on a re-run. If they recur, I'm happy to dig into the underlying flaky timeout (the 180s distributed_ddl_task_timeout is just too tight on coverage builds for these MV-stress tests), but that would be a separate fix, unrelated to issue #104900.

Thanks!

@groeneai (session cron:clickhouse-ci-task-worker:20260518-033000)

@tuanpach

Copy link
Copy Markdown
Member

@groeneai

The CI was rerun. There are still some failures. Check them.

Pick up fix ClickHouse#105194 (02180_group_by_lowcardinality stabilization) and other
master fixes to clear chronic flakes on PR ClickHouse#105106 CI.
@groeneai

Copy link
Copy Markdown
Contributor Author

Hi @tuanpach, I rechecked the 9 failures on commit 30f1835b. None are caused by this PR. Just pushed a master merge to clear one of them; the other 8 are the same chronic Replicated DDL timeouts I reported earlier.

Failure breakdown

# Failing test Type Status
1 Tests/02180_group_by_lowcardinality (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel) Known flaky, already fixed on master by #105194 Cleared by master merge aaf3da69 just pushed
2-9 test_refreshable_mv_in_replicated_db (x6) / test_long_query_cancel (x2) Chronic ReplicatedDatabase DDL timeout flake Unchanged from previous reply

(1) 02180_group_by_lowcardinality — already fixed, just needed rebase

(2) test_refreshable_mv_in_replicated_db and test_long_query_cancel — chronic, unrelated

Both tests fail with the same root cause as the previous reply: ReplicatedDatabase DDL task ... is not finished on 1 of 2 hosts ... Was waiting for 180 seconds or Timeout exceeded while receiving data from server. Waited for 300 seconds on DROP TABLE ... ON CLUSTER default. Neither stack passes through StorageMaterializedView::dropTempTable, the only path this PR touches.

CIDB cross-PR distribution (last 30 days, both tests):

test_refreshable_mv_in_replicated_db        test_long_query_cancel
PR 105106 (this PR): 11 hits across 6 checks  PR 105106: 4 hits across 2 checks
PR 101757 (al13n321 KeeperDispatcher): 10/5  PR 101757: 4 hits across 2 checks
PR 104051: 2/2                                PR 104051: 1/1
PR 103849: 1/1

The smoking gun: @al13n321's PR #101757 ("KeeperDispatcher overhaul", which only touches src/Coordination/ and contrib/NuRaft, zero refreshable MV code) hit the SAME 2 tests in the SAME pattern at the SAME rate. This confirms it is environmental flakiness in Replicated DDL coordination under slow sanitizer/coverage runners, not a refreshable MV bug.

What I just did

Pushed master merge aaf3da69 to this branch. New CI run will start automatically. Expected outcome:

  • 02180 will pass (fix is now in branch)
  • The 8 Replicated DDL timeout slots may or may not flake again (environmental). If they fail again on the same 2 tests, that confirms the chronic-flake attribution and the PR is mergeable.

Could you re-trigger CI / wait for the new commit's run, and if those 2 tests still flake, mark the PR mergeable? Happy to file a separate issue for the chronic Replicated DDL timeout pattern if you think that would help (it would be a sibling to existing flaky-test tracking).

@groeneai

groeneai commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

CI ledger — covered SHA: a9167c7289f6666b149e1eb0c1926b921f76d846

CI fully finished (Mergeable Check completed; 0 pending checks).

Check Test Disposition
Stateless tests (amd_tsan, parallel, 1/2) 00096_obfuscator_save_load, 00175_obfuscator_schema_inference flaky chronic trunk — tasks 2026-06-02-ci-00096obfuscatorsaveload-chronic-f / 2026-06-02-ci-00175obfuscatorschemainference-ab
Integration tests (amd_msan/amd_tsan/amd_asan_ubsan/amd_llvm_coverage/arm_binary distributed plan) (multiple variants) test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db chronic PR-only flaky on Replicated DB integration tests; 0 master / 7 distinct PRs in 60d (incl. unrelated branches like throw-on-unknown-config-option, shard-profile-events-per-cpu). Tracked in task 2026-06-04-ci-testrefreshablemvinreplicateddb. NOT PR-caused.
Integration tests (amd_llvm_coverage, 3/5), Integration tests (arm_binary, distributed plan, 2/4) test_refreshable_mat_view_replicated/test.py::test_long_query_cancel chronic PR-only flaky — same Replicated-DB-DDL-race family as the test above.
Integration tests (amd_msan, 1/6) test_refreshable_mat_view_replicated/test.py::test_circular_dependencies_survive_restart tracked in 2026-06-04-ci-p2-testcirculardependenciessurviv (chronic).
Stateless tests (amd_tsan, s3 storage, parallel, 2/2) 02813_starting_in_text_log chronic flaky text_log timing test on s3 storage.

No PR-caused failures. The fix is small (StorageMaterializedView::dropTempTable bypasses max_table_size_to_drop for refresh-managed temp tables) and the new regression test 04247_104900_refreshable_mv_tmp_table_leak.sh exercises it cleanly. Awaiting human review (@ PedroTadim).

Session: cron:our-pr-ci-monitor:20260604-213000

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on this head (7d9308a4de9, master-merge). The 8 red checks are the same chronic environmental failures already classified on the prior commit (#105106 (comment)); none are caused by this PR (0 master hits in 30d):

@clickhouse-gh

clickhouse-gh Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Dear @tuanpach, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@tuanpach tuanpach self-assigned this Jun 23, 2026
@tuanpach

Copy link
Copy Markdown
Member

@groeneai

Check the CI failure. Any PRs fixing them?

@groeneai

Copy link
Copy Markdown
Contributor Author

@tuanpach Re-checked CI on the current HEAD (7d9308a). Three reds, all chronic, none caused by this PR. It only touches StorageMaterializedView::dropTempTable, which is internal to RefreshTask and is never reached by a user DROP TABLE.

  1. test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db (5 integration jobs). Fails at drop table re.<name> with rc 159, Timeout exceeded ... 300 seconds: the Replicated-DB DDL task does not finish in time on the slow sanitizer/coverage runners. Last 30d: 0 master failures, 27 failures across 6 PRs (5 unrelated). Our merged Fix flaky test_refreshable_mv_in_replicated_db #107132 fixed a different failure mode of this test (the Refreshable view ... doesn't exist race after SYSTEM SYNC DATABASE REPLICA) and is already in this branch. No PR currently addresses this DROP-timeout mode, which is generic Replicated-DDL-coordination slowness rather than a test bug.

  2. test_refreshable_mat_view_replicated/test.py::test_long_query_cancel (2 integration jobs). Same signature: DROP TABLE ... ON CLUSTER default times out at 300s. Last 30d: 1 master failure, 8 failures across 5 PRs. No fix PR exists.

  3. Stress test (arm_asan_ubsan), Hung check failed, possible deadlock found. Trunk-wide chronic hung-check, not specific to this PR: last 30d = 290 master failures across 350 PRs. No single fix PR; it is a class of trunk hangs under stress.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105106&sha=7d9308a4de9f45929efeb6ff2702ea582bc80ae3&name_0=PR&name_1=Integration%20tests%20%28amd_llvm_coverage%2C%202%2F5%29

The PR is otherwise green and you approved it; these reds are known-flaky CI gating only.

Comment thread src/Storages/StorageMaterializedView.cpp Outdated
…CREATE OR REPLACE

prepareRefresh created the .tmp.inner_id.<uuid> table with CREATE OR REPLACE.
When a leftover .tmp.inner_id.<uuid> from a previous failed refresh already
existed, CREATE OR REPLACE renamed it to a _tmp_replace_<random> name and then
dropped it through the normal path that honors max_table_size_to_drop. A leftover
larger than that limit failed to drop and was leaked under the _tmp_replace name,
and every later refresh repeated the leak until the disk filled (issue ClickHouse#104900).

Drop the leftover explicitly with DROP TABLE IF EXISTS on a context that bypasses
max_table_size_to_drop / max_partition_size_to_drop (same as dropTempTable), then
do a plain CREATE TABLE. The async drop frees the table name synchronously on
Atomic and Replicated databases, so the following CREATE does not collide.

Extends the 04247 regression test with a scenario that plants a large leftover
.tmp.inner_id.<uuid> before a refresh and asserts no .tmp.* or _tmp_replace_*
table remains afterwards.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@al13n321 Done, both in 4cecc62.

  • L709 comment: kept your wording (you already committed it in 518d023).
  • RMV change: prepareRefresh now does DROP TABLE IF EXISTS .tmp.inner_id.<uuid> (on a context that pins max_table_size_to_drop/max_partition_size_to_drop to 0, same as dropTempTable) then a plain CREATE TABLE, instead of CREATE OR REPLACE TABLE. The async drop frees the name synchronously on Atomic/Replicated, so the following CREATE does not collide.

Verified end-to-end on a debug build by planting a large leftover .tmp.inner_id.<view_uuid> and refreshing with max_table_size_to_drop = 1:

  • before (CREATE OR REPLACE): refresh fails Code 359 and leaks _tmp_replace_<random> (prepareRefresh -> doCreateOrReplaceTable -> InterpreterDropQuery -> checkTableCanBeDropped).
  • after (this change): refresh succeeds, no .tmp.* or _tmp_replace_* left behind.

Extended 04247_104900_refreshable_mv_tmp_table_leak with that scenario (passes 10/10 with randomization).

Session id: cron:clickhouse-worker-slot-4:20260629-221300

@groeneai

Copy link
Copy Markdown
Contributor Author
Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. Plant a large .tmp.inner_id.<view_uuid> (1000 rows), set max_table_size_to_drop = 1, SYSTEM REFRESH VIEW. On the old code the refresh fails every time with Code 359 and leaks _tmp_replace_<random>.
b Root cause explained? prepareRefresh used CREATE OR REPLACE TABLE. When the target .tmp.inner_id.<uuid> already exists, doCreateOrReplaceTable renames it to _tmp_replace_<random> then drops it via a context that honors max_table_size_to_drop. A leftover larger than the limit fails checkTableCanBeDropped, the rename is not rolled back, and _tmp_replace_<random> is leaked. Every later refresh repeats it.
c Fix matches root cause? Yes. Drop the leftover explicitly with DROP TABLE IF EXISTS on a size-bypassed context (same as dropTempTable), then plain CREATE TABLE. This removes the CREATE OR REPLACE rename-then-size-checked-drop path entirely, so nothing is left under _tmp_replace_*.
d Test intent preserved / new tests added? Existing scenario (rotated-out drop) preserved. Added scenario B to 04247_104900_refreshable_mv_tmp_table_leak: plants a large leftover and asserts 0 .tmp.* and 0 _tmp_replace_* after a refresh.
e Both directions demonstrated? Yes, end-to-end on a debug build. Old binary: refresh fails Code 359, leaks _tmp_replace_* (stack prepareRefresh -> doCreateOrReplaceTable -> InterpreterDropQuery -> checkTableCanBeDropped). New binary: refresh succeeds, 0 leaks. Test passes 10/10 with randomization.
f Fix is general across code paths? Yes. prepareRefresh is the only RMV site using CREATE OR REPLACE for the temp inner table; the post-exchange drop in dropTempTable already bypasses the size limit. Both refresh-owned temp-table paths now bypass it consistently.
g Fix generalizes across inputs? Yes. The leftover name is deterministic (.tmp.inner_id.<view_uuid>) regardless of MV column types/engine; the explicit DROP uses if_exists=true so it is a no-op when no leftover exists (normal case), and size-bypassed when one does. Verified the normal multi-refresh path (no leftover) stays clean.
h Backward compatible? Yes. Internal refresh behavior only; no setting default, on-disk/wire format, or user-visible semantics change. The temp inner table name and the resulting inner table are unchanged.
i Invariants and contracts preserved? Yes. The async DROP (sync=false) frees the table name synchronously on Atomic/Replicated (DatabaseAtomic::dropTableImpl detaches the name + erases the path map entry under lock; sync only gates background data cleanup keyed by UUID), so the subsequent CREATE cannot hit TABLE_ALREADY_EXISTS. Each CREATE uses a fresh UUID, so no store/<uuid>/ collision. Keeps the existing "don't wait for other replicas" semantics.

Session id: cron:clickhouse-worker-slot-4:20260629-221300

…e, just bypass drop size limits

The previous commit replaced the atomic CREATE OR REPLACE TABLE of the temp inner
table .tmp.inner_id.<view_uuid> with a non-atomic DROP TABLE IF EXISTS (sync=false)
followed by a plain CREATE TABLE. On a Replicated database that splits one atomic
replicated-DDL transaction into two separate, un-waited DDL-log entries on a fixed
table name, which opened a race where a replica observes an empty inner table after a
refresh. That surfaced as test_refreshable_mv_in_replicated_db failing at the
"select * from re.a" assertion (empty result), a signature that appeared for the first
time only on that commit.

Restore the atomic CREATE OR REPLACE TABLE and instead fix the original ClickHouse#104900 leak at
its real cause: doCreateOrReplaceTable renames an existing leftover to _tmp_replace_<random>
and drops it through a context copied from the create context, and that drop honors
max_table_size_to_drop. The leak happened only because the refresh context did not bypass
the size limit. Run the CREATE OR REPLACE on a context with max_table_size_to_drop=0 and
max_partition_size_to_drop=0 so the internal drop of the leftover is size-bypassed, fixing
the leak while keeping the create/exchange/drop in a single atomic replicated-DDL transaction.

The 04247 regression test still asserts no .tmp.* and no _tmp_replace_* tables remain after a
refresh that finds a large leftover; both scenarios pass.

Closes: ClickHouse#104900

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@al13n321 The DROP TABLE IF EXISTS + CREATE TABLE change (commit 4cecc62) caused a refresh-correctness regression in test_refreshable_mv_in_replicated_db: on a Replicated database it failed at assert select * from re.a == "0\n10\n" with an empty inner table. That assertion has never failed in 90 days of CI for this PR except on that one commit.

Cause: on a Replicated DB, CREATE OR REPLACE TABLE is a single atomic replicated-DDL transaction (the create of the temp table, the EXCHANGE rename, and the drop of the old one are committed in one ZK multi-op via the isCreateOrReplaceQuery path). Splitting it into a separate DROP TABLE IF EXISTS (sync=false) and a plain CREATE TABLE makes two independent, un-waited DDL-log entries on the fixed name .tmp.inner_id.<view_uuid>, which opens a window where a replica observes/exchanges an empty inner table.

So I kept CREATE OR REPLACE TABLE but fixed the _tmp_replace_* leak at its real cause instead: doCreateOrReplaceTable drops the renamed leftover through a context copied from the create context, and that drop honors max_table_size_to_drop. The leak happened only because the refresh context did not bypass the size limit. New commit (99b36f3) runs the CREATE OR REPLACE on a context with max_table_size_to_drop = 0 and max_partition_size_to_drop = 0, so the internal drop of the leftover is size-bypassed. This fixes the _tmp_replace_* leak while keeping the operation atomic.

Verified both directions locally on a debug build:

  • Without the size bypass, CREATE OR REPLACE on a 1000-row leftover with max_table_size_to_drop=1 throws code 359 and leaks _tmp_replace_<random>.
  • With the fix, the 04247 regression test (plant a large leftover, refresh, assert 0 .tmp.* and 0 _tmp_replace_*) passes 5/5.
  • test_refreshable_mv_in_replicated_db no longer hits the empty-re.a assertion locally; only the pre-existing Replicated-DDL drop timeout flakes, which is unrelated to this change.

@groeneai

Copy link
Copy Markdown
Contributor Author
Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes for the leak direction: CREATE OR REPLACE TABLE t ... AS SELECT ... numbers(1000) over a 1000-row table with max_table_size_to_drop=1 throws code 359 and leaks _tmp_replace_<random> every time. The empty-re.a regression is a Replicated-DDL race, not deterministic; addressed by mechanism (restoring the atomic single-transaction path that was empirically clean across this PR's full CI history).
b Root cause explained? The prior commit split the atomic CREATE OR REPLACE TABLE .tmp.inner_id.<uuid> into DROP TABLE IF EXISTS (sync=false) + plain CREATE TABLE. On a Replicated DB that is two separate un-waited DDL-log entries on a fixed name instead of one atomic ZK multi-op (isCreateOrReplaceQuery path), so a replica can observe/exchange an empty inner table, failing select * from re.a. The original _tmp_replace_* leak (#104900) is because doCreateOrReplaceTable's internal drop of the renamed leftover uses a context copied from the create context and honors max_table_size_to_drop, which the refresh context did not bypass.
c Fix matches root cause? Yes. Restore atomic CREATE OR REPLACE (fixes the empty-table race) and set max_table_size_to_drop=0 + max_partition_size_to_drop=0 on the create context so the internal leftover-drop is size-bypassed (fixes the leak). Both causes addressed directly, no band-aid.
d Test intent preserved / new tests added? Preserved. 04247 still plants a large leftover .tmp.inner_id.<view_uuid> and asserts 0 .tmp.* and 0 _tmp_replace_* after a refresh, plus the rotated-target Scenario A. Only the Scenario B comment was updated to describe the new mechanism; assertions unchanged.
e Both directions demonstrated? Yes. Without the size bypass, plain CREATE OR REPLACE leaks _tmp_replace_* (code 359). With the fix, 04247 passes 5/5; test_refreshable_mv_in_replicated_db no longer hits the empty-re.a assertion.
f Fix is general across code paths? Yes. prepareRefresh is the only refresh table-create site. The size bypass mirrors the existing dropTempTable in the same file, so the create and drop sides are now consistent. No sibling path leaks.
g Fix generalizes across inputs? The setting bypass is type-agnostic (it gates the drop-size check, independent of column types). Verified with MergeTree inner tables; the inner-table engine comes from the view definition and the path does not branch on datatype/wrapper.
h Backward compatible? Yes. No setting default, on-disk/wire format, or replication metadata change. The size-bypass applies only to the RMV-internal temp-table create context, never to user-issued DROP/CREATE.
i Invariants and contracts preserved? Yes. Restoring CREATE OR REPLACE restores the single-atomic-transaction invariant the Replicated-DDL log relies on (which the split had broken). The fresh-UUID temp table and EXCHANGE-by-name semantics are unchanged. Context::createCopy preserves the ZK metadata transaction, same pattern as the existing dropTempTable.

Session id: cron:clickhouse-worker-slot-1:20260630-014500

Comment thread src/Storages/StorageMaterializedView.cpp Outdated
…e limits

The previous commit ran the temp inner table CREATE OR REPLACE and the
temp table DROP on a Context::createCopy(refresh_context) so it could set
max_table_size_to_drop=0 / max_partition_size_to_drop=0. ContextData's
copy constructor is hand-written and does not copy parent_table_uuid,
ddl_query_cancellation, or ddl_additional_checks_on_enqueue, which
RefreshTask sets on refresh_context before prepareRefresh. On a Replicated
database the copied context therefore enqueued the refresh DDL without the
parent MV UUID, the Keeper root-version check, and the DDL cancellation
token, so a lagged replica could replay stale .tmp.inner_id.<uuid> entries
instead of skipping them.

Set the two drop-size settings on refresh_context itself instead of a copy.
The settings only gate DROP size checks, refresh_context is created fresh
per refresh, and dropTempTable wants the same bypass anyway, so mutating it
in place is safe and matches exchangeTargetTable which already runs its
interpreter on refresh_context directly. doCreateOrReplaceTable's internal
drop of the renamed _tmp_replace_* leftover inherits the bypass via its
create context, so the issue ClickHouse#104900 leak fix still holds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Confirmed and fixed in 76673cc.

Context::createCopy runs new Context(*other), and ContextData's copy constructor (Context.cpp) is hand-written: it lists each copied member explicitly and copies metadata_transaction but omits parent_table_uuid, ddl_query_cancellation, and ddl_additional_checks_on_enqueue. So the copied context lost the refresh DDL metadata that RefreshTask sets on refresh_context (RefreshTask.cpp), and on a Replicated database the CREATE OR REPLACE / DROP enqueued without the parent UUID, the Keeper root-version check, and the cancellation token. Same problem in both prepareRefresh and dropTempTable.

Took option (1): set max_table_size_to_drop / max_partition_size_to_drop on refresh_context itself instead of a copy. The two settings only gate the DROP size check, refresh_context is created fresh per refresh, and dropTempTable wants the same bypass anyway, so mutating it in place is safe. This also matches exchangeTargetTable, which already runs its interpreter on refresh_context directly. doCreateOrReplaceTable's internal drop of the renamed _tmp_replace_* leftover inherits the bypass via its create context, so the #104900 leak fix still holds.

Verified: 04247_104900_refreshable_mv_tmp_table_leak 11/11, and test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db 3/3 (the empty-re.a line-97 path the previous commit regressed no longer reproduces).

@groeneai

Copy link
Copy Markdown
Contributor Author
Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes for the defect: ContextData's hand-written copy ctor provably omits the three DDL fields, so createCopy drops them 100% of the time. The runtime symptom (empty re.a) is a Replicated-DDL race surfaced on the prior commit (4cecc62); the metadata-loss root defect is deterministic.
b Root cause explained? createCopy -> new Context(*other) -> hand-written ContextData(const ContextData&) copies an explicit member list that includes metadata_transaction but excludes parent_table_uuid / ddl_query_cancellation / ddl_additional_checks_on_enqueue. RefreshTask sets those on refresh_context; the copy default-inits them; DatabaseReplicatedWorker / DDLTask then enqueue the refresh DDL without the parent-UUID skip, Keeper version check, and cancellation token.
c Fix matches root cause? Yes. Set the two drop-size settings on refresh_context itself (no copy) in both prepareRefresh and dropTempTable, so the DDL metadata is carried into the create/drop interpreters.
d Test intent preserved / new tests added? Yes. 04247 still asserts 0 .tmp.* and 0 _tmp_replace_* after refreshes (Scenario A+B); test_refreshable_mv_in_replicated_db still asserts the replicated basic-refresh result. No assertion weakened.
e Both directions demonstrated? The prior commit (4cecc62, createCopy) regressed test_refreshable_mv_in_replicated_db line-97 (empty re.a, CIDB: first-ever hit, only on 4cecc62); with this fix that test passes 3/3 and 04247 passes 11/11.
f Fix is general across code paths? Both lossy createCopy sites in the refresh path (prepareRefresh create, dropTempTable drop) are fixed. exchangeTargetTable already used refresh_context directly. No other refresh DDL goes through a copied context.
g Fix generalizes across inputs? The metadata-loss is input-independent (any refresh on a Replicated DB). Covers append and non-append refresh, leftover-present and leftover-absent, and the root_znode_version == -1 (no enqueue check) branch.
h Backward compatible? Yes. No setting default, on-disk/wire/replication format, or new validation changed. Restores the original pre-PR behavior (interpreter on refresh_context) plus an in-context size bypass.
i Invariants and contracts preserved? Yes. refresh_context is created fresh per refresh, so the in-place setting change has per-refresh scope; INSERT and RENAME do not read the drop-size settings, so the create/exchange are unaffected. The Replicated-DDL contract (parent-UUID skip / Keeper version check / cancellation) is now upheld, which the copied context broke.

Session id: cron:clickhouse-worker-slot-2:20260630-035700

@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.70% 92.70% +0.00%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 9/10 (90.00%) · Uncovered code

Full report · Diff report

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 76673cc

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
Stress test (arm_tsan) / Hung check failed, possible deadlock found deadlock (chronic shutdown hung-check family, 0 PR-caused; refreshable_mv integration tests passed 0 FAIL on this head, so the prior line-97 empty-result regression is resolved) #105905 (ours, open)
Mergeable Check aggregator reflecting the above resolves with the hung-check owner above
CH Inc sync CH Inc sync (private, not actionable)

The bot DDL-metadata fix (commit 76673cc) is clean: test_refreshable_mv_in_replicated_db and the sibling refreshable-MV integration tests passed (0 FAIL / 240180 OK across 144 jobs on this head), confirming the line-97 empty-re.a regression and the createCopy-drops-DDL-metadata issue are both resolved.

Session id: cron:our-pr-ci-monitor:20260630-090000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When a Refreshable MaterializedView fails to delete tmp tables, it keeps creating more tmp tables until the server runs out of disk

4 participants