Fix Block structure mismatch on lazy_load_tables=1 during ALTER RENAME COLUMN (#104819)#104852
Conversation
…AME (ClickHouse#104819) `StorageTableProxy` caches the column list from the `CREATE TABLE` query at construction and updates its in-memory metadata lazily in `StorageProxy::alter` *after* `nested->alter` returns. When the underlying `alter` blocks waiting for a long-running mutation (for example, a `RENAME COLUMN` while merges are stopped after `DETACH DATABASE` / `ATTACH DATABASE`), the nested storage's `setProperties` has already published the new schema, but the proxy is still serving the pre-alter schema. A concurrent `INSERT` resolves column names against the proxy's stale view, then `proxy.write` builds a sink from the nested's current metadata, and the pipeline crashes in `Chain::addSink` with Logical error: 'Block structure mismatch in function connect between RemovingSparseTransform and MergeTreeSink stream: different names of columns: c0 ... c1' Override `getInMemoryMetadataPtr` in `StorageTableProxy` to forward to the nested storage once it has been materialized, so every metadata observer sees the same schema as the nested storage and the race window disappears. Adds a stateless regression test under `tests/queries/0_stateless/`. Issue: ClickHouse#104819
|
Workflow [PR], commit [2be399e] Summary: ✅
AI ReviewSummaryThis PR fixes a real lazy-table metadata consistency bug by making Final VerdictStatus: ✅ Approve |
…04819) The original `04239_alter_rename_column_lazy_load_tables_104819` reproduced the bug by issuing an `INSERT` while the proxy's metadata was stale, which killed the server with `Block structure mismatch`. `clickhouse-test` records a server-death as `server-died` (mapped to `ERROR`, not `FAIL`) and the bugfix-validation framework only inverts `FAIL <-> OK`, so the job reported "Failed to reproduce the bug" even though the bug fired. Re-target the same race window with `DESCRIBE TABLE` instead of `INSERT`: - `DESCRIBE TABLE` calls `storage->getInMemoryMetadataPtr`, which without the fix returns the proxy's stale cached metadata (`c0`) and with the fix forwards to the nested storage (`c1`). - The reference pins the expected column name to `c1`, so the buggy master-HEAD binary produces a plain output diff (`FAIL`) that the bugfix-validation framework can invert to `OK`/"bug reproduced". - `SHOW CREATE TABLE` is polled first to confirm the race window has opened (`setProperties` has updated the nested storage and rewritten the on-disk `CREATE TABLE`). Verified locally on the same Debug build: - Without the override on `StorageTableProxy::getInMemoryMetadataPtr`, 20/20 runs print `c0` (test FAILs against the new reference). - With the override in place, 30/30 runs print `c1` (test PASSes). The server is not killed by the new test, so bugfix validation sees a real `FAIL` and the regular stateless tests see a clean `OK`. See CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104852&sha=28a83d8fd16d042392976376bf7edb6e0ac7fded&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29 Session: cron:clickhouse-ci-task-worker:20260513-144500
|
@alesapin @CurtizJ @tavplubix — fixup pushed ( Root cause of the Bugfix-validation FAILThe old test reproduced the bug by issuing an What the fixup changes
The fix itself ( Pre-PR validation gate (re-run on the new test)a) Deterministic repro? Yes — b) Root cause explained? c) Fix matches root cause? Yes — the override is the minimal change d) Test intent preserved? The new test exercises the same bug e) Both directions demonstrated? Yes, on this exact Debug build:
CI report for the previous FAIL: Session: cron:clickhouse-ci-task-worker:20260513-144500 |
|
@groeneai can you check the failed tests? I think they may not be related |
|
@PedroTadim Confirmed — all 7 CI failures on commit Per-failure attribution (CIDB, 30-day window):
Relevant PR-specific checks already pass:
Recommend just re-triggering CI if you want a clean run for the merge — or merge as-is, since the PR-relevant gates are green and the test exercises the exact race PedroTadim's reproducer hits (verified locally: WITHOUT fix prints stale |
CI ledger — covered SHA:
|
| Check | Test | Disposition |
|---|---|---|
Unit tests (asan_ubsan) |
(check-level) | infra/chronic UBSan UntrackedMemoryCounter — task 2026-06-03-ci-p1-chronic-ubsan-in-untrackedmemoryc |
Stateless tests (amd_tsan, parallel, 1/2) |
(check-level, no test_name in CIDB) | likely follows the obfuscator chronic flaky pattern; no specific PR-caused failures captured. |
No PR-caused failures captured in CIDB for this SHA. Awaiting human review (@ PedroTadim).
Session: cron:our-pr-ci-monitor:20260604-213000
|
Hello! This PR seems in a good shape to merge? |
|
Yes, from our side. CI ledger on the current head (c617882) is posted above:
The 2 red checks are both chronic on those lanes, unrelated to this diff:
We pinged the component owners (alesapin, CurtizJ, tavplubix) on the original review request but haven't seen substantive feedback yet. If you can give it the nod or nudge a maintainer, we're standing by for any last comments. Session: cron:clickhouse-ci-task-worker:20260608-183200 |
|
Check the failure in CI. |
|
Thanks for the approval @tuanpach. Both red checks are the same already-reverted master bug, not caused by this PR. This commit (
Fixing PR: #106365 (the revert, merged 2026-06-03 06:28Z). Master is clean since then: on amd_tsan This PR touches only |
|
@groeneai Please merge master to this branch. |
…-lazy-load-tables-proxy-metadata-race Adapt StorageTableProxy::getInMemoryMetadataPtr override return type from StorageMetadataPtr to StorageMetadataHandle to match the IStorage base virtual signature changed on master (lifetime-safe metadata handle). No behavioral change; the override still forwards to the nested storage.
|
@tuanpach Merged current master into the branch (now at One adaptation was needed: master changed the The 2 prior red checks were both the already-reverted #104690 (stale base): |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 6/6 (100.00%) | Lost baseline coverage: none · Uncovered code |

StorageTableProxycaches the column list from theCREATE TABLEquery atconstruction and updates its own in-memory metadata only lazily in
StorageProxy::alterafternested->alterreturns. When the underlyingalterblocks waiting for a long-running mutation, the nested storage'ssetPropertieshas already published the new schema, but the proxy stillserves the pre-alter schema to any caller that looks up its metadata.
A concurrent
INSERTthen resolves column names against the proxy's staleview (succeeds), and
proxy.writebuilds a sink from the nested's currentmetadata, so the pipeline ends up with a source on the old column name and
a sink on the new column name. The chain crashes in
Chain::addSinkwithReproducer (from @PedroTadim on the issue):
The DETACH/ATTACH cycle is what arms the bug: it forces the table to be
served via a
StorageTableProxy(becauselazy_load_tables=1), andSYSTEM STOP MERGESkeeps the rename mutation pending so the proxy'scached metadata never gets synced to the post-rename schema.
The fix overrides
getInMemoryMetadataPtrinStorageTableProxytoforward to the nested storage once it has been materialized. After that
point all metadata observers see the same schema as the nested storage,
so the source and sink of the
INSERTpipeline are built against thesame metadata snapshot. The bad
INSERTnow fails cleanly withNO_SUCH_COLUMN_IN_TABLEand the server stays alive.Verified locally:
Block structure mismatchLOGICAL_ERRORfromChain::addSink.Code: 16. DB::Exception: No such column c0 in table d0.t0 (NO_SUCH_COLUMN_IN_TABLE)and the server stays up.04239_alter_rename_column_lazy_load_tables_104819passes 20/20 runs locally.
03823_lazy_load_tables,03276_merge_tree_index_lazy_load,03128_merge_tree_index_lazy_load,01213_alter_rename_column) still pass.Closes #104819
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a server abort (
Logical error: 'Block structure mismatch') thatcould occur during a concurrent
INSERTwhile anALTER RENAME COLUMNis in flight in an
Atomicdatabase withlazy_load_tables = 1afterDETACH DATABASE/ATTACH DATABASE.Documentation entry for user-facing changes
Session: cron:clickhouse-ci-task-worker:20260513-121500
Version info
26.7.1.176