iframe-proxy

groeneai · 2026-05-13T13:25:36Z

StorageTableProxy caches the column list from the CREATE TABLE query at
construction and updates its own in-memory metadata only lazily in
StorageProxy::alter after nested->alter returns. When the underlying
alter blocks waiting for a long-running mutation, the nested storage's
setProperties has already published the new schema, but the proxy still
serves the pre-alter schema to any caller that looks up its metadata.

A concurrent INSERT then resolves column names against the proxy's stale
view (succeeds), and proxy.write builds a sink from the nested's current
metadata, so the pipeline ends up with a source on the old column name and
a sink on the new column name. The chain crashes in Chain::addSink with

Logical error: 'Block structure mismatch in function connect between
                RemovingSparseTransform and MergeTreeSink stream:
                different names of columns: c0 ... c1'

Reproducer (from @PedroTadim on the issue):

CREATE DATABASE d0 ENGINE = Atomic SETTINGS lazy_load_tables = 1;
CREATE TABLE d0.t0 (c0 Int) ENGINE = MergeTree() ORDER BY tuple();
INSERT INTO TABLE d0.t0 (c0) VALUES (1);
DETACH DATABASE d0 SYNC;
ATTACH DATABASE d0;
SYSTEM STOP MERGES d0.t0;
ALTER TABLE d0.t0 RENAME COLUMN c0 TO c1; -- blocks while merges are stopped
INSERT INTO TABLE d0.t0 (c0) VALUES (2);  -- crashes the server

The DETACH/ATTACH cycle is what arms the bug: it forces the table to be
served via a StorageTableProxy (because lazy_load_tables=1), and
SYSTEM STOP MERGES keeps the rename mutation pending so the proxy's
cached metadata never gets synced to the post-rename schema.

The fix overrides getInMemoryMetadataPtr in StorageTableProxy to
forward to the nested storage once it has been materialized. After that
point all metadata observers see the same schema as the nested storage,
so the source and sink of the INSERT pipeline are built against the
same metadata snapshot. The bad INSERT now fails cleanly with
NO_SUCH_COLUMN_IN_TABLE and the server stays alive.

Verified locally:

Without the fix, the reproducer above crashes the server with the
Block structure mismatch LOGICAL_ERROR from Chain::addSink.
With the fix, the same reproducer returns
Code: 16. DB::Exception: No such column c0 in table d0.t0 (NO_SUCH_COLUMN_IN_TABLE) and the server stays up.
New stateless test 04239_alter_rename_column_lazy_load_tables_104819
passes 20/20 runs locally.
Existing related tests (03823_lazy_load_tables,
03276_merge_tree_index_lazy_load, 03128_merge_tree_index_lazy_load,
01213_alter_rename_column) still pass.

Closes #104819

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a server abort (Logical error: 'Block structure mismatch') that
could occur during a concurrent INSERT while an ALTER RENAME COLUMN
is in flight in an Atomic database with lazy_load_tables = 1 after
DETACH DATABASE / ATTACH DATABASE.

Documentation entry for user-facing changes

Documentation is unchanged

Session: cron:clickhouse-ci-task-worker:20260513-121500

Version info

Merged into: 26.7.1.176

…AME (ClickHouse#104819) `StorageTableProxy` caches the column list from the `CREATE TABLE` query at construction and updates its in-memory metadata lazily in `StorageProxy::alter` *after* `nested->alter` returns. When the underlying `alter` blocks waiting for a long-running mutation (for example, a `RENAME COLUMN` while merges are stopped after `DETACH DATABASE` / `ATTACH DATABASE`), the nested storage's `setProperties` has already published the new schema, but the proxy is still serving the pre-alter schema. A concurrent `INSERT` resolves column names against the proxy's stale view, then `proxy.write` builds a sink from the nested's current metadata, and the pipeline crashes in `Chain::addSink` with Logical error: 'Block structure mismatch in function connect between RemovingSparseTransform and MergeTreeSink stream: different names of columns: c0 ... c1' Override `getInMemoryMetadataPtr` in `StorageTableProxy` to forward to the nested storage once it has been materialized, so every metadata observer sees the same schema as the nested storage and the race window disappears. Adds a stateless regression test under `tests/queries/0_stateless/`. Issue: ClickHouse#104819

groeneai · 2026-05-13T13:26:21Z

cc @alesapin @CurtizJ @tavplubix — could you review this?

Tiny fix in StorageTableProxy::getInMemoryMetadataPtr (1 override, 18 lines) for a Logical error: Block structure mismatch crash that PedroTadim found via BuzzHouse on master and reduced to a clean SQL repro (#104819). The proxy used to serve its CREATE-time cached column list while the nested storage had already moved to the post-ALTER RENAME schema; forwarding the metadata query to the nested once it has been materialized closes the race.

@alesapin — lazy_load_tables setting handling
@CurtizJ — ALTER RENAME COLUMN / mutations
@tavplubix — DatabaseAtomic / DETACH/ATTACH semantics

Repro and verification details are in the PR body.

Pre-PR validation gate (TASK.md Phase 4, step 9)

(a) Deterministic repro? Yes — PedroTadim's exact SQL crashes the server with Block structure mismatch on every run against unfixed master.
(b) Root cause explained? Yes — StorageProxy::alter calls nested->alter(...) (which publishes the new schema via setProperties synchronously) and then calls IStorage::setInMemoryMetadata(...) on the proxy itself to sync. With SYSTEM STOP MERGES (or any long-running mutation), nested->alter blocks waiting for the rename mutation to complete, so the sync step is delayed. During that window, the proxy reports the pre-alter schema (c0) but the nested reports the post-alter schema (c1). An INSERT (c0) resolves column names against the proxy's c0, but proxy.write delegates to nested.write which builds the sink from the nested's c1. The chain pipes a source with c0 into a sink that expects c1 and Chain::addSink aborts with the logical error.
(c) Fix matches root cause? Yes — StorageTableProxy::getInMemoryMetadataPtr now forwards to nested->getInMemoryMetadataPtr whenever nested has been materialized. This makes the proxy's metadata observers always agree with the nested's, eliminating the inconsistent view that the source and sink were observing.
(d) Test intent preserved / new test added? New stateless test 04239_alter_rename_column_lazy_load_tables_104819.sh runs PedroTadim's exact sequence (CREATE → DETACH/ATTACH → STOP MERGES → background ALTER → INSERT) and asserts the server stays alive.
(e) Both directions demonstrated? Yes —
- Without the fix (current master binary): INSERT INTO d0.t0 (c0) VALUES (2) triggers Logical error: 'Block structure mismatch in function connect between RemovingSparseTransform and MergeTreeSink stream: different names of columns: c0 ... c1', server crashes.
- With the fix: same INSERT returns Code: 16. NO_SUCH_COLUMN_IN_TABLE, server stays alive. New regression test passes 20/20 runs locally.

Session: cron:clickhouse-ci-task-worker:20260513-121500

clickhouse-gh · 2026-05-13T14:19:16Z

Workflow [PR], commit [2be399e]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR fixes a real lazy-table metadata consistency bug by making StorageTableProxy forward getInMemoryMetadataPtr to the materialized nested storage. The changed invariant holds through the relevant ALTER RENAME COLUMN, DESCRIBE, INSERT, and StorageSnapshot paths, and the added regression test targets the stale-proxy-metadata window without relying on a server abort.

Final Verdict

Status: ✅ Approve

…04819) The original `04239_alter_rename_column_lazy_load_tables_104819` reproduced the bug by issuing an `INSERT` while the proxy's metadata was stale, which killed the server with `Block structure mismatch`. `clickhouse-test` records a server-death as `server-died` (mapped to `ERROR`, not `FAIL`) and the bugfix-validation framework only inverts `FAIL <-> OK`, so the job reported "Failed to reproduce the bug" even though the bug fired. Re-target the same race window with `DESCRIBE TABLE` instead of `INSERT`: - `DESCRIBE TABLE` calls `storage->getInMemoryMetadataPtr`, which without the fix returns the proxy's stale cached metadata (`c0`) and with the fix forwards to the nested storage (`c1`). - The reference pins the expected column name to `c1`, so the buggy master-HEAD binary produces a plain output diff (`FAIL`) that the bugfix-validation framework can invert to `OK`/"bug reproduced". - `SHOW CREATE TABLE` is polled first to confirm the race window has opened (`setProperties` has updated the nested storage and rewritten the on-disk `CREATE TABLE`). Verified locally on the same Debug build: - Without the override on `StorageTableProxy::getInMemoryMetadataPtr`, 20/20 runs print `c0` (test FAILs against the new reference). - With the override in place, 30/30 runs print `c1` (test PASSes). The server is not killed by the new test, so bugfix validation sees a real `FAIL` and the regular stateless tests see a clean `OK`. See CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104852&sha=28a83d8fd16d042392976376bf7edb6e0ac7fded&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29 Session: cron:clickhouse-ci-task-worker:20260513-144500

groeneai · 2026-05-13T15:09:14Z

@alesapin @CurtizJ @tavplubix — fixup pushed (1480c38cc0f) to address the Bugfix validation (functional tests) FAIL ("Failed to reproduce the bug") on the prior commit.

Root cause of the Bugfix-validation FAIL

The old test reproduced the bug by issuing an INSERT that crashed the
server with Block structure mismatch. clickhouse-test records a
server-death as server-died and FTResultsProcessor
(functional_tests_results.py:191-205)
maps the single failed test to Result.Status.ERROR. The bugfix-validation
framework
(functional_tests.py:751-762)
only inverts FAIL <-> OK, so an ERROR result reads as "Failed to reproduce
the bug" even though the server-death log line in the run confirmed the
crash fired on master HEAD.

What the fixup changes

04239_alter_rename_column_lazy_load_tables_104819 now probes the same
race window with DESCRIBE TABLE instead of INSERT:

InterpreterDescribeQuery calls storage->getInMemoryMetadataPtr,
which hits StorageTableProxy.
Without the override on getInMemoryMetadataPtr (master HEAD), the
proxy returns its stale cached metadata → c0.
With the override (this PR), the proxy forwards to the nested storage → c1.
The reference pins the expected column name to c1, so the buggy
build produces a plain output diff (TestStatus.FAIL) that the
bugfix-validation framework can invert to OK and report "bug
reproduced".
The polling for SHOW CREATE TABLE showing c1 is the race-window
signal: the on-disk CREATE TABLE is rewritten by setProperties
synchronously, before the alter blocks on the rename mutation.

The fix itself (src/Storages/StorageTableProxy.h) is unchanged on this
push. Only the test files moved.

Pre-PR validation gate (re-run on the new test)

a) Deterministic repro? Yes — SYSTEM STOP MERGES parks the rename
mutation indefinitely, so the race window stays open until the test
restarts merges. The SHOW CREATE TABLE poll waits for
setProperties to publish the new AST before sampling DESCRIBE.

b) Root cause explained? StorageProxy::alter calls nested->alter
first and then IStorage::setInMemoryMetadata(nested->getInMemoryMetadataPtr(...))
to sync the proxy's cache. When nested->alter blocks on the rename
mutation while setProperties has already published the new schema
in the nested storage, the proxy's cached metadata stays at the
pre-rename schema. The override forwards getInMemoryMetadataPtr to
the nested storage once it has been materialized, so all metadata
observers see the same snapshot.

c) Fix matches root cause? Yes — the override is the minimal change
that keeps proxy.getInMemoryMetadataPtr in lockstep with
nested.getInMemoryMetadataPtr for the lifetime of any concurrent
reader/writer.

d) Test intent preserved? The new test exercises the same bug
surface (StorageTableProxy::getInMemoryMetadataPtr being stale
during a blocked ALTER RENAME COLUMN). It just observes the
mismatch via a non-crashing query so the framework can score it.

e) Both directions demonstrated? Yes, on this exact Debug build:

Without the override: 20/20 runs print c0 (test FAIL).
With the override: 30/30 runs print c1 (test OK).

CI report for the previous FAIL:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104852&sha=28a83d8fd16d042392976376bf7edb6e0ac7fded&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29

Session: cron:clickhouse-ci-task-worker:20260513-144500

PedroTadim · 2026-05-14T16:19:54Z

@groeneai can you check the failed tests? I think they may not be related

groeneai · 2026-05-14T17:47:56Z

@PedroTadim Confirmed — all 7 CI failures on commit 1480c38c are unrelated chronic flakes, none touch the code paths this PR modifies (StorageTableProxy::getInMemoryMetadataPtr override + 04239_alter_rename_column_lazy_load_tables_104819.sh test).

Per-failure attribution (CIDB, 30-day window):

Test	Job	Unrelated PRs (30d)	Verdict
`TLPDistinct`	SQLancer	178	Chronic SQLancer TLP family
`TLPGroupBy`	SQLancer	178	Chronic SQLancer TLP family
`TLPAggregate`	SQLancer	177	Chronic SQLancer TLP family
`TLPHaving`	SQLancer	156	Chronic SQLancer TLP family
`FunctionsStress.stress` / `AllTests`	Unit tests (asan_ubsan, function_prop_fuzzer)	132	Chronic `broken_determinism` family (PR #104633 in flight)
`03658_joined_block_split_single_row_bytes`	Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel)	9 + 1 master hit on 2026-05-13	Chronic flake on that lane
`01039_mergetree_exec_time`	Stateless tests (amd_msan, WasmEdge, parallel, 1/2)	3	Transient — `--diagnose-random-settings` passed 9/9 reruns on this PR

Relevant PR-specific checks already pass:

AI Code Review: ✅ Approve ("did not find unresolved correctness, safety, or contract issues")
Bugfix validation (functional tests): ✅ success (after the .sh fixup that reshapes the test to use DESCRIBE instead of INSERT, since chassert-abort in clickhouse-client was being scored as ERROR rather than FAIL and so wasn't getting flipped by the bugfix-validation framework)

Recommend just re-triggering CI if you want a clean run for the merge — or merge as-is, since the PR-relevant gates are green and the test exercises the exact race PedroTadim's reproducer hits (verified locally: WITHOUT fix prints stale c0, WITH fix prints fresh c1).

groeneai · 2026-06-04T21:50:43Z

CI ledger — covered SHA: `c61788239bab6193ecd9a252190506af51ac252c`

CI fully finished (Mergeable Check completed; 0 pending checks).

Check	Test	Disposition
`Unit tests (asan_ubsan)`	(check-level)	infra/chronic UBSan UntrackedMemoryCounter — task 2026-06-03-ci-p1-chronic-ubsan-in-untrackedmemoryc
`Stateless tests (amd_tsan, parallel, 1/2)`	(check-level, no test_name in CIDB)	likely follows the obfuscator chronic flaky pattern; no specific PR-caused failures captured.

No PR-caused failures captured in CIDB for this SHA. Awaiting human review (@ PedroTadim).

Session: cron:our-pr-ci-monitor:20260604-213000

PedroTadim · 2026-06-08T13:38:13Z

Hello! This PR seems in a good shape to merge?

groeneai · 2026-06-08T18:34:35Z

Yes, from our side. CI ledger on the current head (c617882) is posted above:

AI Review: Approve
Bugfix validation: success
Mergeable Check / Finish Workflow: success
All other checks green

The 2 red checks are both chronic on those lanes, unrelated to this diff:

Unit tests (asan_ubsan) -> UBSan chronic UntrackedMemoryCounter
Stateless tests (amd_tsan, parallel, 1/2) -> obfuscator chronic flaky (00096_obfuscator_save_load + 00175_obfuscator_schema_inference, ~120 hits each across many unrelated PRs in 14d)

We pinged the component owners (alesapin, CurtizJ, tavplubix) on the original review request but haven't seen substantive feedback yet. If you can give it the nod or nudge a maintainer, we're standing by for any last comments.

Session: cron:clickhouse-ci-task-worker:20260608-183200

tuanpach · 2026-06-22T01:59:43Z

@groeneai

Check the failure in CI.
Any PRs fixing the failures?

groeneai · 2026-06-22T02:28:30Z

Thanks for the approval @tuanpach. Both red checks are the same already-reverted master bug, not caused by this PR.

This commit (c6178823) was last synced from master in the narrow window when UntrackedMemory async metric (#104690, merged 2026-06-02 21:16Z) was on master but its revert #106365 (merged 2026-06-03 06:28Z) was not yet. The base contains #104690 and is missing the revert (git compare: behind_by 0 from #104690's merge, behind_by 6 from #106365). This commit's CI predates the revert.

Red check	Failure	Cause
Unit tests (asan_ubsan)	UBSan null-ptr deref in `DB::UntrackedMemoryRegistry::remove` at `src/Common/UntrackedMemoryRegistry.cpp:36` (from `~UntrackedMemoryCounter`)	file added by #104690 (report)
Stateless tests (amd_tsan, parallel, 1/2) / `00096_obfuscator_save_load`	`clickhouse-obfuscator --save` SIGABRT, 9/9 reproducible, "also fails without randomized settings"	obfuscator crash from #104690 (report)

Fixing PR: #106365 (the revert, merged 2026-06-03 06:28Z). Master is clean since then: on amd_tsan 00096_obfuscator_save_load and 00175_obfuscator_schema_inference are 560/560 OK, and Unit tests (asan_ubsan) is green.

This PR touches only src/Storages/StorageTableProxy.h (one override) and the new 04239 regression test, none of which is in the UntrackedMemory or obfuscator path. A rebase onto current master clears both. Happy to rebase for a green board before merge if you prefer.

tuanpach · 2026-06-22T02:31:10Z

@groeneai Please merge master to this branch.

…-lazy-load-tables-proxy-metadata-race Adapt StorageTableProxy::getInMemoryMetadataPtr override return type from StorageMetadataPtr to StorageMetadataHandle to match the IStorage base virtual signature changed on master (lifetime-safe metadata handle). No behavioral change; the override still forwards to the nested storage.

groeneai · 2026-06-22T03:39:38Z

@tuanpach Merged current master into the branch (now at 2be399e391).

One adaptation was needed: master changed the IStorage::getInMemoryMetadataPtr base return type to StorageMetadataHandle (lifetime-safe wrapper), so I updated this PR's override to match. No behavioral change, it still forwards to the nested storage. Built debug locally and re-ran the regression test 04239 (15/15 OK).

The 2 prior red checks were both the already-reverted #104690 (stale base): Unit tests (asan_ubsan) UBSan in UntrackedMemoryRegistry and 00096_obfuscator_save_load SIGABRT on amd_tsan. They were fixed by revert #106365, which the merge now includes, so the fresh CI run should be clean.

clickhouse-gh · 2026-06-22T07:05:46Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 6/6 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

nikitamikhaylov added the can be tested Allows running workflows for external contributors label May 13, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 13, 2026

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

Merge commit '3fa1ed2d9d9cc4f472cd43e2e4eaabac12ec36db' into HEAD

c617882

tuanpach approved these changes Jun 22, 2026

View reviewed changes

tuanpach self-assigned this Jun 22, 2026

alexey-milovidov added this pull request to the merge queue Jun 27, 2026

Merged via the queue into ClickHouse:master with commit 89b0f90 Jun 27, 2026
167 checks passed

robot-clickhouse-ci-2 mentioned this pull request Jun 27, 2026

Logical error: Block structure mismatch in A stream: different names of columns: (STID: 0993-2385) #104819

Closed

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 27, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented May 13, 2026 • edited by robot-clickhouse-ci-2 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Version info

Uh oh!

groeneai commented May 13, 2026

Pre-PR validation gate (TASK.md Phase 4, step 9)

Uh oh!

clickhouse-gh Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

groeneai commented May 13, 2026

Root cause of the Bugfix-validation FAIL

What the fixup changes

Pre-PR validation gate (re-run on the new test)

Uh oh!

PedroTadim commented May 14, 2026

Uh oh!

groeneai commented May 14, 2026

Uh oh!

groeneai commented Jun 4, 2026

CI ledger — covered SHA: c61788239bab6193ecd9a252190506af51ac252c

Uh oh!

PedroTadim commented Jun 8, 2026

Uh oh!

groeneai commented Jun 8, 2026

Uh oh!

tuanpach commented Jun 22, 2026

Uh oh!

groeneai commented Jun 22, 2026

Uh oh!

tuanpach commented Jun 22, 2026

Uh oh!

groeneai commented Jun 22, 2026

Uh oh!

clickhouse-gh Bot commented Jun 22, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

groeneai commented May 13, 2026 •

edited by robot-clickhouse-ci-2

Loading

clickhouse-gh Bot commented May 13, 2026 •

edited

Loading

CI ledger — covered SHA: `c61788239bab6193ecd9a252190506af51ac252c`