Clear async load tasks on detach to fix self-suggesting table hint by groeneai · Pull Request #106238 · ClickHouse/ClickHouse · GitHub
Skip to content

Clear async load tasks on detach to fix self-suggesting table hint#106238

Merged
alexey-milovidov merged 5 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-91777-startup-table-leak
Jun 29, 2026
Merged

Clear async load tasks on detach to fix self-suggesting table hint#106238
alexey-milovidov merged 5 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-91777-startup-table-leak

Conversation

@groeneai

@groeneai groeneai commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

After a server restart (or DETACH DATABASE / ATTACH DATABASE), a second DROP TABLE on an already-dropped table produces a confusing self-suggesting hint:

Table default.t0 does not exist. Maybe you meant default.t0?. (UNKNOWN_TABLE)

The async-load pipeline populates DatabaseOrdinary::startup_table (and load_table), but DatabaseWithOwnTablesBase::detachTableUnlocked only erases from tables. The stale name survives in startup_table, DatabaseOrdinary::getAllTableNames keeps reporting it, and TableNameHints matches the just-dropped name against itself.

Make detachTableUnlocked virtual and override it in DatabaseOrdinary so that it also erases the entry from startup_table and load_table. The override fires through every detach/drop path that calls into the base class implementation, so Ordinary, Atomic, and Replicated are all fixed (Atomic and Replicated inherit from DatabaseOrdinary).

A stateless regression test reproduces the bug via DETACH DATABASE + ATTACH DATABASE, which re-runs the same TablesLoader path as a server restart and is enough to populate startup_table without needing the integration-test harness. The test fails on master HEAD without the patch (BAD: stale startup_table entry leaked into hint) and passes with the patch (OK); it was looped 20x locally without flakes. Legitimate hints to existing tables (typo bet for table beta) are preserved.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix confusing "Maybe you meant X?" hint after a server restart (or DETACH/ATTACH DATABASE), where dropping an already-dropped table would suggest the just-dropped name as the alternative. The async-load tasks are now cleared from DatabaseOrdinary::startup_table and load_table on detach.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Closes #91777.

Version info

  • Merged into: 26.7.1.233

@groeneai

groeneai commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@PedroTadim PedroTadim added the can be tested Allows running workflows for external contributors label Jun 1, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [a8a201d]

Summary:


AI Review

Summary

This PR clears stale async-load bookkeeping from DatabaseOrdinary::startup_table and load_table when a table is detached, and also covers the DatabaseAtomic::renameTable detach lambda that bypasses detachTableUnlocked. That matches the linked issue's root cause: TableNameHints reads getAllTableNames, so removed table names must not remain in the async-load maps after restart or DETACH DATABASE / ATTACH DATABASE. I did not find any remaining correctness issues in the current diff; the earlier bot findings about test fail-fast behavior and the Atomic rename path are addressed, and the latest Praktika report for a8a201d5e51b14e3f6ebbae4f7dd3bd8fbd9712b shows no failed checks.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 1, 2026
@PedroTadim

Copy link
Copy Markdown
Member

@groeneai many tests failed, check.

@groeneai

Copy link
Copy Markdown
Contributor Author

Checked. The failures on the previous commit (9431e00) were all explained by one issue in the regression test, now fixed in 83cc50e. Breakdown:

CI ledger, covered SHA 9431e00:

  • Stateless tests (...) x6 (amd_asan_ubsan distributed, amd_tsan parallel, amd_msan WasmEdge parallel, amd_msan flaky, arm_binary parallel) failing on 04304_91777_drop_after_attach_database_no_self_hint: test bug. The hint check pattern Maybe you meant.*\.t\? also matched a legitimate cross-database hint in parallel mode (Maybe you meant test_<rand>.t?). The source fix works; the test mis-flagged it. Fixed in 83cc50e (exact self-match + fail-fast). Detail on the inline thread.
  • Server died x4 in flaky-check / targeted jobs: infrastructure noise, not this PR. This marker appears 381 times across 63 distinct PRs in the last 14 days. The server log for our run shows only clean-shutdown messages (no crash, no OOM, no sanitizer report).
  • Mergeable Check / CH Inc sync / PR: aggregator jobs reflecting the above, not independent failures.

The test never failed on master or any other PR (only this PR's 2026-06-01 run). New CI is running on 83cc50e.

session: cron:clickhouse-worker-slot-23:20260610-184200

@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (fix-up commit 83cc50e)

# Question Answer
a Deterministic repro? Yes. Running the test body against a live server: a synthetic self-suggestion string (Maybe you meant ${DB}.t?) is flagged BAD; a deliberately broken setup command (bad table engine) makes the script exit non-zero with no OK (fail-fast); a clean run prints OK 20/20.
b Root cause explained? The 6 stateless failures came from the test, not the source fix. The check used grep -qE 'Maybe you meant.*\.t\?'. In parallel mode the hint system legitimately cross-suggests a concurrent test's table (Maybe you meant test_n3czwljm.t?, confirmed in the arm_binary report). That loose regex matched the cross-DB hint and printed BAD, even though the fix correctly stopped the self-suggestion. CI reruns passed 35/35, consistent with a parallel-collision false positive.
c Fix matches root cause? Yes. The check now matches exactly Maybe you meant ${DB}.t? (self-suggestion to the just-dropped table), so cross-database hints no longer trip it. Separately, set -euo pipefail + cleanup trap close the silent-setup-bypass hole (the earlier bot/reviewer note); the expected second DROP TABLE stays guarded with `
d Test intent preserved / new tests added? Preserved. The test still asserts the exact bug (a self-suggestion to ${DB}.t) and is now stricter: a failed setup can no longer reach a false OK. It only stops flagging the legitimate cross-DB case it was never meant to catch.
e Both directions demonstrated? Yes. Clean run -> OK (20/20). Broken setup -> fail-fast, no false OK. Synthetic self-suggestion -> BAD.
f Fix is general, not a narrow patch? N/A (test-only change; the source fix in DatabaseOrdinary::detachTableUnlocked is correct and unchanged, confirmed by the cross-DB hint pointing away from self).

Session id: cron:clickhouse-worker-slot-23:20260610-184200

Comment thread src/Databases/DatabaseOrdinary.cpp
After server restart (or DETACH/ATTACH DATABASE) the async-load pipeline
populates `DatabaseOrdinary::startup_table` and `load_table`, but
`DatabaseWithOwnTablesBase::detachTableUnlocked` only removes the entry
from `tables`. The stale entry in `startup_table` survives, and
`DatabaseOrdinary::getAllTableNames` reports it as a known name. The
next failed lookup of the just-detached name asks `TableNameHints` for
a suggestion, which then matches the dropped name against itself and
produces the confusing

  Table X.Y does not exist. Maybe you meant X.Y?

Make `detachTableUnlocked` virtual and override it in `DatabaseOrdinary`
to also erase the entries from `startup_table` and `load_table`, via a
shared `eraseAsyncLoadState` helper. The override fires through every
detach/drop path that calls into the base class implementation:
`DatabaseWithOwnTablesBase::detachTable`, `DatabaseAtomic::detachTable`,
`DatabaseAtomic::dropTableImpl`, and `DatabaseOnDisk::dropTable` /
`detachTablePermanently` / `renameTable` (via `detachTable`).

`DatabaseAtomic::renameTable` does not go through `detachTableUnlocked`:
it detaches via a local lambda that erases `tables`/`table_name_to_path`
directly. Call `eraseAsyncLoadState` from that lambda too, otherwise a
`RENAME TABLE` on an async-loaded Atomic/Replicated table leaves the old
name in `startup_table` and a later typo lookup near the old name still
suggests it. Applies to `Ordinary`, `Atomic`, and `Replicated` (all
inherit from `DatabaseOrdinary`).

Closes ClickHouse#91777.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the groeneai/fix-91777-startup-table-leak branch from 83cc50e to 6ea3728 Compare June 11, 2026 00:43
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate

(For the renameTable fix-up + extended regression test pushed in this revision.)

# Question Answer
a Deterministic repro? Yes. On a debug build, CREATE DATABASE db; CREATE TABLE db.alpha (...) ENGINE=MergeTree; DETACH DATABASE db; ATTACH DATABASE db; RENAME TABLE db.alpha TO db.beta; then DROP TABLE db.alphx deterministically prints Maybe you meant db.alpha? on the unfixed build (in an isolated DB so no cross-DB hint competes).
b Root cause explained? DETACH/ATTACH DATABASE re-runs async loading, populating DatabaseOrdinary::startup_table with alpha. DatabaseAtomic::renameTable detaches the source via a local lambda that erases only tables/table_name_to_path, never going through detachTableUnlocked, so startup_table["alpha"] survives. getAllTableNames returns tables ∪ startup_table = {beta, alpha}, so a typo one edit from alpha is answered with the nonexistent db.alpha (issue #91777).
c Fix matches root cause? Yes. Extracted the async-load-state erase into DatabaseOrdinary::eraseAsyncLoadState(name) and call it from both detachTableUnlocked and the Atomic rename detach lambda, so the rename path clears startup_table/load_table for the old name at the exact point it removes it from tables. No symptom guard.
d Test intent preserved / new tests added? Yes. Added scenario 2 to 04304_91777_drop_after_attach_database_no_self_hint.sh (RENAME path) alongside the existing DROP scenario; reference updated to OK\nOK. The hint check matches the exact Maybe you meant <db>.<name>? string so a legitimate cross-DB hint from a concurrent parallel test is not flagged.
e Both directions demonstrated? Yes. Two debug builds from the same tree differing only by this change: scenario 2 prints BAD: stale startup_table entry leaked into hint after RENAME on the unfixed build and OK on the fixed build. The full test passes 25/25 on the fixed build.
f Fix is general, not a narrow patch? Yes. Checked sibling paths: the Ordinary path (DatabaseOnDisk::renameTable) already routes through detachTable -> detachTableUnlocked, so it is covered; DatabaseReplicated::renameTable delegates to DatabaseAtomic::renameTable, so it is covered by the same lambda fix. The shared helper means no detach/rename path can bypass the cleanup. Fix is at the source (where the name is removed from tables), not a guard at the hint site.

Session id: cron:clickhouse-worker-slot-9:20260610-234100

Stateless test for issue ClickHouse#91777. It uses DETACH DATABASE + ATTACH
DATABASE to re-trigger the async-load path that populates startup_table
without a full server restart, then checks two paths that previously
leaked a no-longer-present name into a table-not-found hint:

  1. DROP TABLE: after dropping an async-loaded table, a lookup of a
     name one edit away must not be answered with a hint pointing at
     the dropped table.
  2. RENAME TABLE: after renaming an async-loaded table away, a lookup
     one edit from the old name must not suggest the old name.

Both scenarios use a near-miss name rather than the exact dropped name,
so they reproduce independently of the exact-name self-suggestion that
getHintForTable already suppresses, and so each scenario fails on master
without the fix. The hint check is database-qualified so a legitimate
cross-database hint from a concurrent parallel test is not mistaken for
the bug. Setup runs under set -euo pipefail with a cleanup trap so a
failed setup command cannot reach a false OK.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the groeneai/fix-91777-startup-table-leak branch from 6ea3728 to eaee452 Compare June 11, 2026 01:01
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (commit eaee452)

This commit strengthens the regression test so both scenarios fail on current master and pass with the fix. Context: the symptom-layer fix #95116 ("Fix bad hints suggesting the exact same name", merged 2026-06-04) now suppresses the exact-name self-suggestion Maybe you meant <db>.t?. The previous test asserted only that exact-name case, so on current master it passed regardless of this PR's fix and Bugfix validation (functional tests) reported "Failed to reproduce the bug". This PR's fix targets a different, still-reproducing defect: a removed name (dropped or renamed) lingering in startup_table/load_table and being offered as a near-miss hint, which #95116 does not guard.

# Question Answer
a Deterministic repro? Yes. On master (debug, includes #95116): CREATE DATABASE db; CREATE TABLE db.tbl0 ... MergeTree; DETACH DATABASE db; ATTACH DATABASE db; DROP TABLE db.tbl0 SYNC; DROP TABLE db.tbl1 SYNCTable db.tbl1 does not exist. Maybe you meant db.tbl0? (suggests the dropped table). Same with RENAME TABLE db.alpha TO db.beta then DROP TABLE db.alphx SYNC → suggests the renamed-away alpha.
b Root cause explained? DatabaseOrdinary::getAllTableNames unions tables with the async-load startup_table map. detachTableUnlocked (DROP/DETACH) removed only from tables; DatabaseAtomic::renameTable's local detach lambda likewise left the old name in startup_table. TableNameHints::getAllRegisteredNames reads getAllTableNames, so a removed name remained a hint candidate. Levenshtein(tbl0,tbl1)=1 ≤ mistake_factor=2, and tbl0tbl1, so #95116's exact-name guard does not fire and the dropped name is suggested.
c Fix matches root cause? Yes. DatabaseOrdinary::detachTableUnlocked is overridden to erase the name from startup_table/load_table (shared eraseAsyncLoadState helper), and DatabaseAtomic::renameTable's detach path calls the same helper. The base detachTableUnlocked is made virtual so Ordinary/Atomic/Replicated all route through the cleanup. The removed name is no longer a hint candidate.
d Test intent preserved / new tests added? Strengthened. The test still covers both DROP and RENAME, but each scenario now uses a near-miss lookup (tbl1→dropped tbl0; alphx→renamed-away alpha) instead of an exact-name self-suggestion. The exact-name version passed on master independent of this fix (masked by #95116) and validated nothing; the near-miss version fails on master for both scenarios and validates both the DROP-path and RENAME-path fixes.
e Both directions demonstrated? Yes, through the test harness. No-fix master binary (Build ID ebc16ed5): both scenarios print BAD (test fails). With-fix binary (Build ID 312caa8e): OK\nOK, 15/15 runs, no flakiness. Also confirmed legitimate hints are preserved (typo for an existing table still suggests it) and an async-loaded existing table is still suggested (fix does not over-erase).
f Fix is general, not a narrow patch? Yes. The cleanup is at the source of the leak (the detach/rename paths that mutate tables), not a guard at the hint site. Making the base detachTableUnlocked virtual covers all DatabaseWithOwnTablesBase subclasses; the Atomic rename path is handled explicitly since it bypasses detachTableUnlocked. getAllTableNames's only other consumer (DatabaseReplicated debug ZK-diff) also benefits from the corrected name set.

Session id: cron:clickhouse-worker-slot-20:20260611-002300

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on `eaee452d47d7`. Both failures are chronic/infra, not caused by this PR (which touches `DatabaseOrdinary.cpp` / `DatabaseAtomic.cpp` async-load cleanup + the regression test):

  • `Stateless tests (arm_asan_ubsan, azure, sequential)` — `Server died` / `AddressSanitizer` (STID 3772-3b00): the Azure arm_asan_ubsan mass outage of 2026-06-10/11 (ASan secondary-allocator OOM). Not PR-related.
  • `Stress test (arm_ubsan)` — `Hung check failed, possible deadlock found`: chronic Stress arm_ubsan hung-check noise, 24 PRs / 44 master in 30d. Not specific to this PR.

The renameTable async-load-state cleanup gap flagged by the bot is addressed in the current HEAD (shared `eraseAsyncLoadState` helper called from both `detachTableUnlocked` and the `DatabaseAtomic::renameTable` detach lambda), with an Atomic-rename regression case. Ready for review.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI summary for HEAD eaee452d47d7 — all red marks are non-PR-caused (infra / chronic trunk), the PR's own checks are green:

  • Stateless tests (arm_asan_ubsan, azure, sequential) — "Server died" + AddressSanitizer: CHECK failed: sanitizer_allocator_secondary.h:199 (STID 3772-3b00). This is the known 2026-06-10/11 azure ASan internal LargeMmapAllocator OOM outage (no ClickHouse frames in the stack); it hit ~40 unrelated tests / 50+ PRs + master in the same window. Not caused by this PR (which only clears async load tasks on detach).
  • Stress test (arm_ubsan) — "Hung check failed, possible deadlock found". Chronic stress-test infra finding across 12+ unrelated PRs in 14d. Tracked separately.
  • Fast test, Style check, Bugfix validation (functional), Build, and all other jobs pass or skip.

CH Inc sync red is the internal CH mirror sync, expected on fork PRs (not a test).

@PedroTadim

Copy link
Copy Markdown
Member

@groeneai merge with master

@groeneai

Copy link
Copy Markdown
Contributor Author

Merged master into the branch in 6c9ca50. Conflict-free; the fix (shared eraseAsyncLoadState from detachTableUnlocked + the DatabaseAtomic::renameTable detach lambda) and the 04304 regression are unchanged. Rebuilt against current master and both 04304 scenarios still pass. Fresh CI triggered.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on the merge commit (6c9ca50). The two non-infra checks that show red are not caused by this PR (which touches only DatabaseAtomic/DatabaseOrdinary async-load cleanup + the 04304 test):

  • Stress test (arm_release)Last stored last_written_position in meta file ... is bigger than current last_written_pos (StorageFileLog, STID 2733-34dd). Pre-existing trunk bug in StorageFileLog, unrelated to async-load detach; surfaced under stress (Thread Fuzzer + RandomQueryKiller) on 2 unrelated PRs in 30d, 0 on master. Tracked and fixed separately in Fix FileLog crash on delete+recreate reusing the same inode #107617.
  • CH Inc sync — internal mirror check; does not run for fork PRs.

Finish Workflow is green and all real test jobs passed.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on the current HEAD (6c9ca50, the 2026-06-16 auto master-merge; the two real fix/test commits 22658b4 + eaee452 are unchanged). The two fix/test commits are clean; the only red is a pre-existing trunk failure unrelated to this PR:

The PR-own checks (Bugfix validation of 04304, Stateless, Builds) are green. Closes #91777.

Comment thread src/Databases/DatabaseOrdinary.cpp Outdated
StoragePtr DatabaseOrdinary::detachTableUnlocked(const String & table_name)
{
eraseAsyncLoadState(table_name);
return DatabaseWithOwnTablesBase::detachTableUnlocked(table_name);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@groeneai Should we eraseAsyncLoadState after DatabaseWithOwnTablesBase::detachTableUnlocked?
In case DatabaseWithOwnTablesBase::detachTableUnlocked throws an exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in 36cf155. The base DatabaseWithOwnTablesBase::detachTableUnlocked throws UNKNOWN_TABLE before mutating anything when the table is absent, so erasing first would clear the async-load state for a table that was never detached. Now the base detach runs first and eraseAsyncLoadState only runs on success (both map erases are noexcept). The DatabaseAtomic::renameTable detach lambda was already correct: it erases after tables.erase/table_name_to_path.erase. 04304 still passes 20/20.

DatabaseOrdinary::detachTableUnlocked erased startup_table/load_table
before calling DatabaseWithOwnTablesBase::detachTableUnlocked. The base
throws UNKNOWN_TABLE before mutating anything when the table is absent, so
on that throw the async-load state was cleared for a table that was never
detached. Call the base first and erase only on success.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

CI ledger — covered SHA 36cf1552a00b

CI fully finished on this HEAD (Finish Workflow + Mergeable Check complete, no checks queued/in_progress). This HEAD is the eraseAsyncLoadState exception-safety reorder (base detach first, erase async-load state only on success) addressing @tuanpach's review.

The PR touches only src/Databases/{DatabaseAtomic.cpp,DatabaseOrdinary.cpp,DatabaseOrdinary.h,DatabasesCommon.h} (async-load-state cleanup on detach/rename) plus the 04304 regression test. One red check, not PR-caused:

  • Stress test (arm_debug) / "Hung check failed, possible deadlock found" — chronic Stress Thread-Fuzzer hung-check, not specific to this PR. 30d CIDB: 278 master hits + 20+ unrelated PRs (e.g. 101039, 107567, 104396, 80416, 91836, 105321, 107899, 99495). The hung stacktrace is executeASTFuzzerQueries -> PipelineExecutor::executeImpl waiting under ThreadFuzzer injection in jemalloc; zero async-load / detach / rename frames. Tracked separately (chronic Hung-check deadlock tracker).
  • PR (aggregator) — rollup of the above; no distinct failure.

No PR-caused failures. Ready for review.

@tuanpach

Copy link
Copy Markdown
Member

@groeneai

Check CI failures. Any PRs fixing them?

@groeneai

Copy link
Copy Markdown
Contributor Author

@tuanpach The only red is Stress test (arm_debug) -> "Hung check failed, possible deadlock found" (report). The PR check is just its rollup; every other job is green.

It is not PR-caused and not actually a deadlock. The hung stack is the server-side AST fuzzer running a cancelled-but-still-executing fuzzed query past the 90s hung-check window:

ThreadPoolImpl::wait
PipelineExecutor::executeImpl
CompletedPipelineExecutor::execute
DB::executeASTFuzzerQueries

No frame touches this PR's code (the DatabaseOrdinary/DatabaseAtomic async-load cleanup). It is a chronic trunk failure across every stress sanitizer lane: on arm_debug alone, 49 distinct PRs hit it in the last 30d vs only 2 master.

PRs fixing it:

The PR is otherwise green.

@clickhouse-gh

clickhouse-gh Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 14/14 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 29, 2026
Merged via the queue into ClickHouse:master with commit 042af6d Jun 29, 2026
174 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Confusing drop message after server restart

5 participants