MaterializedPostgreSQL: don't break all replication on one table's structure mismatch by alexey-milovidov · Pull Request #107427 · ClickHouse/ClickHouse · GitHub
Skip to content

MaterializedPostgreSQL: don't break all replication on one table's structure mismatch#107427

Merged
alexey-milovidov merged 15 commits into
masterfrom
fix-matpg-column-mismatch-restart
Jul 1, 2026
Merged

MaterializedPostgreSQL: don't break all replication on one table's structure mismatch#107427
alexey-milovidov merged 15 commits into
masterfrom
fix-matpg-column-mismatch-restart

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 13, 2026

Copy link
Copy Markdown
Member

When a column is added to or dropped from a replicated PostgreSQL table while the change is not seen through the replication stream — e.g. the column changes and then the ClickHouse (or PostgreSQL) server restarts — MaterializedPostgreSQL stopped replicating all tables in the database, with:

LOGICAL_ERROR	Columns number mismatch. Attributes: 17, buffer: 20

Root cause. On startup the consumer builds a StorageData for every table; its constructor compares the freshly fetched PostgreSQL column count with the existing nested ClickHouse table and throws Columns number mismatch on a mismatch. That throw happened in the loop that constructs the consumer for the whole database, so one out-of-sync table aborted startup of the entire database, and it could not be recovered without dropping and recreating it.

Fix. Catch the error per table while constructing the consumer. The affected table is skipped (with a clear log message asking the user to DETACH/ATTACH it to reload its structure), while the other tables keep replicating. This mirrors how structure changes that arrive through the replication stream are already handled (the table is marked as skipped, not fatal).

The added integration test changes the structure of one table in PostgreSQL, restarts ClickHouse, and verifies that (1) the other tables keep replicating and (2) the affected table is recoverable with DETACH/ATTACH.

Closes: #66273
Related: #49045

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix MaterializedPostgreSQL stopping replication of an entire database (with LOGICAL_ERROR: Columns number mismatch) when a single replicated table's structure changed while the server was down. Now only the affected table is skipped and can be recovered with DETACH/ATTACH.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

🤖 Generated with Claude Code

Version info

  • Merged into: 26.7.1.382

…e's structure mismatch

When the structure of a replicated PostgreSQL table changes while it is not
observed through the replication stream (for example, a column is added or
dropped in PostgreSQL and then the server restarts), the consumer's
`StorageData` constructor throws `LOGICAL_ERROR: Columns number mismatch`. This
exception was thrown while building the consumer for the whole database, so a
single out-of-sync table stopped replication of *all* tables, and the database
could not recover without being dropped and recreated.

Catch the error per table when constructing the consumer: the affected table is
skipped (with a clear message asking to DETACH/ATTACH it) while the rest of the
tables keep replicating. This mirrors how structure changes detected through the
replication stream are already handled.

Closes: #66273
Related: #49045

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 13, 2026
Comment thread src/Storages/PostgreSQL/MaterializedPostgreSQLConsumer.cpp Outdated
… the catch

The "Columns number mismatch" error is an expected condition (the PostgreSQL
table structure no longer matches the nested ClickHouse table), not an internal
logic error. Throwing it as `LOGICAL_ERROR` aborted the server in debug/sanitizer
builds at exception-construction time, before the catch in the consumer
constructor could skip the table — which is why the integration test crashed the
server in CI.

Throw `POSTGRESQL_REPLICATION_INTERNAL_ERROR` instead, and narrow the catch in
the consumer constructor to that specific error so unrelated initialization
failures still propagate (as suggested in review).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@alexey-milovidov alexey-milovidov self-assigned this Jun 14, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 14, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to a conflict with the base branch Jun 14, 2026

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 14, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to a conflict with the base branch Jun 14, 2026
@kssenii kssenii self-assigned this Jun 15, 2026
alexey-milovidov and others added 3 commits June 15, 2026 15:20
Resolve conflict in tests/integration/test_postgresql_replica_database_engine/test_3.py:
both branches appended new test functions at the same location. Keep all three —
this PR's test_table_schema_changed_while_server_down regression test plus master's
test_numeric_to_int256 and test_numeric_int256_validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Re-merged the latest master into the branch (31efaf43ffb..4673641859c) to refresh CI. The merge was clean with no conflicts, and the net diff is unchanged — only src/Storages/PostgreSQL/MaterializedPostgreSQLConsumer.cpp and tests/integration/test_postgresql_replica_database_engine/test_3.py (byte-faithful to the prior tip).

The only red check was Stress test (arm_debug) failing with Logical error: 'Bad cast from type DB::ColumnString to DB::ColumnLowCardinality' (STID: 4525-5f05). This is a pre-existing, unrelated stress/fuzzer flake, not caused by this PR:

  • It fires on master directly (pull_request_number = 0) and across many unrelated PRs over the last 90 days (2026-03-29/30/31, 2026-05-12, 2026-06-13, 2026-06-16), not only here. Since STID is a stack-trace fingerprint, those older occurrences are the same code location, so it long predates this change.
  • The exception is a ColumnStringColumnLowCardinality bad cast with no relation to MaterializedPostgreSQL; this PR touches only the PostgreSQL replication consumer.
  • It belongs to the auto-filed "Bad cast from type A to B" fuzzer family (testing, fuzz), cf. Logical error: Bad cast from type A to B (STID: 4054-600e) #107598 and Logical error: Bad cast from type A to B (STID: 2241-3a03) #103814.

AI Review verdict is ✅ Approve, @kssenii approved, and there are no unresolved review threads.

alexey-milovidov and others added 3 commits June 24, 2026 16:25
…smatch-restart

# Conflicts:
#	tests/integration/test_postgresql_replica_database_engine/test_3.py
…test_3.py

The 06-24 merge of `master` into this branch rewrote the import block of
`tests/integration/test_postgresql_replica_database_engine/test_3.py`
(master had refactored those imports) and dropped the
`assert_number_of_columns` import, while the test body added by this PR
still calls it. `ruff` failed with:

    test_3.py:1072:5: F821 Undefined name `assert_number_of_columns`

Re-add the import. `assert_number_of_columns` is defined in
`tests/integration/helpers/postgres_utility.py`.

Style check report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107427&sha=de16eaaf5e360e25dffab6b674e458bee1c8c7b7&name_0=PR&name_1=Style%20check

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Storages/PostgreSQL/MaterializedPostgreSQLConsumer.cpp
…recovery

When a table is skipped from replication (for example, skipped at startup
because its structure no longer matched the nested ClickHouse table, as added
in this PR, or skipped after a structure change observed in the replication
stream), and then receives WAL before the user repairs it, the consumer
processes the table's `Relation` message, finds no storage for it, and calls
`markTableAsSkipped`, which records an empty `skip_list` entry keyed by the
PostgreSQL relation id.

The promised `DETACH`/`ATTACH` recovery went through `removeNested`/`addNested`,
neither of which cleared that `skip_list` entry. After the `waiting_list` entry
set by `addNested` was consumed by the first post-attach message, `isSyncAllowed`
saw the stale empty `skip_list` entry and returned `false`, so all further
changes to the reattached table were skipped indefinitely and replication never
resumed.

Clear `skip_list` entries whose relation id maps to the table being brought back
in `addNested`, so the recovery actually works.

The regression test now writes to the affected table before `DETACH`/`ATTACH`
(waiting until it is really marked as skipped in the replication stream) and
again after `ATTACH`, proving that replication resumes instead of only that the
snapshot reload matches once.
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Addressed the AI Review "Request changes" finding (stale skip_list entry on DETACH/ATTACH recovery).

Fix. When a startup-skipped table later receives WAL, the consumer's Relation handler finds no storage for it and markTableAsSkipped records an empty skip_list entry keyed by the PostgreSQL relation id. The DETACH/ATTACH recovery went through removeNested/addNested, neither of which cleared that entry, so once the waiting_list entry set by addNested was consumed, isSyncAllowed saw the stale empty skip_list entry and skipped all further changes to the reattached table indefinitely. addNested now drops any skip_list entry whose relation id maps (via relation_id_to_name) to the table being brought back, so the recovery actually resumes replication. This also fixes the same latent issue for tables skipped after a structure change observed in the replication stream.

Test. The regression test now writes to the affected table before DETACH/ATTACH (waiting on the is skipped from replication stream log line so the stale-skip_list path is really exercised, not raced) and again after ATTACH, asserting the table re-synchronizes — which fails without the fix and passes with it.

The modified MaterializedPostgreSQLConsumer.cpp translation unit compiles cleanly. Latest master was merged into the branch (clean, no conflicts).

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged the latest master into the branch to refresh CI (809e8fb9ae1..449c5fcbe36). The merge was clean with no conflicts, and the net diff is unchanged — only src/Storages/PostgreSQL/MaterializedPostgreSQLConsumer.cpp and tests/integration/test_postgresql_replica_database_engine/test_3.py (+111/-3). The changed translation unit compiles cleanly (-fsyntax-only). AI Review is ✅ Approve, @kssenii approved, and there are no unresolved review threads.

The three previously red checks are all unrelated to this change:

  • Stress test (amd_tsan) and Stress test (arm_tsan) — both Hung check failed, possible deadlock found (STID for the generic TCPHandler/executeQuery path), tracked as the chronic master-wide flake Hung check failed, possible deadlock found #107941. The stack trace has no relation to MaterializedPostgreSQL.
  • CH Inc sync (tests failed, clickhouse-private#60882) — the only failing private job was Integration tests (amd_asan_ubsan, db disk, old analyzer, 5/7), and its single failing test was test_refreshable_mv_no_multi_read/test.py::test_refreshable_mv_attach_without_multi_read (700 passed, 1 failed). That test is unrelated to this PR (refreshable MV / Keeper MULTI_READ), and its flakiness fix cdabef4635c ("Wait for the scheduled pass before asserting view state in the no-multi-read test", Stop refreshable MV gracefully on Keeper without MULTI_READ on ATTACH/restore #108441) was not in the synced commit 809e8fb9 but is included by this master merge. The push re-syncs the private PR to a commit that contains the fix and re-runs the sync from scratch. This PR's own test (test_postgresql_replica_database_engine/test_3.py) passed.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged the latest master into the branch to refresh CI (449c5fcbe36..2325d95ef20). The merge was clean with no conflicts, and the net diff is unchanged — only src/Storages/PostgreSQL/MaterializedPostgreSQLConsumer.cpp and tests/integration/test_postgresql_replica_database_engine/test_3.py (+111/-3, byte-identical to the prior tip 449c5fcbe36).

The two previously red checks are both the same unrelated, chronic master-wide flake:

  • Stress test (amd_tsan) and Stress test (arm_tsan) — both Hung check failed, possible deadlock found, tracked as Hung check failed, possible deadlock found #107941 (open). The stack trace is the generic AST-fuzzer path (executeASTFuzzerQueriesTCPHandler::runImplexecuteQuery) with no relation to MaterializedPostgreSQL. CIDB confirms this fingerprint fires on master directly (pull_request_number = 0) every single day and across 150+ unrelated PRs per day, so it long predates and is independent of this change.

This PR's own integration test (test_postgresql_replica_database_engine/test_3.py::test_table_schema_changed_while_server_down) passed. AI Review is ✅ Approve, @kssenii approved, and there are no unresolved review threads.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The latest CI cycle (commit 2325d95e) is still mostly in progress (stateless, integration, stress, AST fuzzer and BuzzHouse jobs are queued). The only completed red is Unit tests (tsan, function_prop_fuzzer) (FunctionsStress.stress / AllTests), and it is unrelated to this change.

  • The failure is the function property fuzzer flagging reinterpret of a Decimal/DateTime64 value to the same physical type with a different scale — e.g. SELECT reinterpret(CAST('1970-01-01 00:59:59.87239' AS DateTime64(5)), CAST('DateTime64(1)' AS String))function returned column of unexpected type or scale: Const(DateTime64) instead of DateTime64(1). In the FromType == ToType branch of reinterpretAs.cpp the source column is returned verbatim, so its physical scale diverges from the declared result type.
  • It started firing reliably only after the new scale-aware fuzzer gate d3202d63ef8 ("Fix function property stress test scale-blind validity gate and changeDate result scale", writeSlice LOGICAL_ERROR: arrayPushFront/arrayPushBack over Array(DateTime64/Time64) reaches GatherUtils generic path with mismatched column types #108517), which now rejects scale-divergent columns at the producing function's gate (the same hardening that earlier caught and fixed changeYear/changeMonth/…).
  • CIDB confirms this is master-wide and not specific to this PR: today (2026-06-30) FunctionsStress.stress failed 23 times across 20 PRs plus master directly (pull_request_number = 0), all on the same reinterpret-scale family of queries. This PR touches only src/Storages/PostgreSQL/MaterializedPostgreSQLConsumer.cpp and tests/integration/test_postgresql_replica_database_engine/test_3.py, with no relation to reinterpret.
  • A fix is already open: Fix reinterpret to Decimal/DateTime64 with a different scale #108878 ("Fix reinterpret to Decimal/DateTime64 with a different scale"), which rebuilds the result column with the destination scale (IsDataTypeDecimal<ToType> covers Decimal, DateTime64 and Time64). Tracked under the auto-filed fuzzer issue FunctionsStress #107242.

Not re-merging master (the branch is only ~6 commits / a few hours behind) — that would just cancel the in-progress CI and re-roll the same master-wide flake. AI Review is ✅ Approve, @kssenii approved, and there are no unresolved review threads.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The latest completed red is now Upgrade check (amd_release) (Error message in clickhouse-server.log), and it is unrelated to this change — it is a known master-wide flake with a fix already in flight.

The five flagged lines all come from a single stateless test (04210_show_remote_databases_in_system_tables) that intentionally creates two remote database engines pointing at the unreachable documentation address 192.0.2.1:

<Error> mysqlxx::Pool: Failed to connect to MySQL (fake_db@192.0.2.1:3306 as user user): mysqlxx::ConnectionFailed
<Error> Application: Connection to mysql failed 1 times
<Error> DatabaseMySQL: ... Connections to mysql failed: fake_db@192.0.2.1:3306 ... (ALL_CONNECTION_TRIES_FAILED)
<Error> PostgreSQLConnectionPool: Connection error: connection to server at "192.0.2.1", port 5432 failed: timeout expired
<Error> void DB::DatabasePostgreSQL::removeOutdatedTables(): ... Connection to `192.0.2.1:5432` failed ...

When the server is restarted during the upgrade test, those still-attached DatabaseMySQL/DatabasePostgreSQL engines re-probe the unreachable host and log <Error> connection failures, which the upgrade-check allowlist does not yet cover. There is no real MySQL/PostgreSQL server in the upgrade environment, so these connection failures are expected and benign; none of them touch MaterializedPostgreSQL, and this PR changes only src/Storages/PostgreSQL/MaterializedPostgreSQLConsumer.cpp.

CIDB confirms it is a broad master-wide flake, not specific to this PR: Error message in clickhouse-server.log failed with the 192.0.2.1/fake_db messages across 8 PRs on 2026-06-30, 12 on 06-29, 6 on 06-28, 5 on 06-27, 9 on 06-26, starting 2026-06-25 — independently of this change. The fix is already open: #108560 (allowlist the remote-database-engine connection errors in upgrade_runner.sh).

The other completed reds are the same unrelated flakes noted above:

@kssenii approved, AI Review is ✅ Approve, and there are no unresolved review threads.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr status — re-merged current master.

The branch had drifted 157 commits behind master (merge-base from 07:03Z this morning), so I merged origin/master into it (2325d95ef20..4aa7c094b04). The merge is clean — master did not touch either PR file, and the net diff vs master is still exactly the two PR files (MaterializedPostgreSQLConsumer.cpp + test_3.py, +111/-3, byte-identical to before). The changed TU compiles clean (-fsyntax-only) against the newer headers. CI has been re-fired.

Review state unchanged: @kssenii APPROVED, AI Review ✅ Approve, 0 unresolved review threads.

All CI reds remain the same tracked, master-wide flakes — none touches the PostgreSQL replication path, and re-merging cannot clear them until their fixers land:

Blocked solely on those fixers landing + a maintainer merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr status — no code change, blocked on CI completion + merge.

Re-checked the post-merge head 4aa7c094b04 (today's master, merge-base 75154b382fe from 15:52Z, ~4h old). No re-merge: the branch already has today's master, the CI cycle for this head is still actively running (48 jobs in progress, including the Integration tests shards that run this PR's own test_postgresql_replica_database_engine/test_3.py), and re-merging would only cancel it and re-roll the same flake.

Review state: AI Review ✅ Approve (re-run on this head at 19:39Z), @kssenii APPROVED, 0 unresolved review threads (2/2 resolved). POSTGRESQL_REPLICATION_INTERNAL_ERROR (619) is registered in ErrorCodes.cpp and used correctly.

Independent correctness check of the trickiest path (stale skip_list recovery): a table skipped at startup that later receives WAL gets an empty-LSN skip_list entry via markTableAsSkipped. Without clearing it, once the waiting_list entry set by addNested is consumed (one true from isSyncAllowed), the next message falls through to the stale skip_list entry and isSyncAllowed returns false indefinitely. The fix in addNested drops the entry keyed (via relation_id_to_name) by the reattached table, making DETACH/ATTACH recovery behave identically to a fresh addNested (waiting_list-only). The test waits on the is skipped from replication stream log line before DETACH/ATTACH, so it deterministically exercises that path rather than racing it. Confirmed sound.

Only completed reds are the two Stress test (amd_tsan) / Stress test (arm_tsan) Hung check failed, possible deadlock found — the chronic master-wide flake #107941 (OPEN). Their stack traces are the generic query-execution path (PipelineExecutor -> ExecutionThreadContext -> Poco::Thread) with no MaterializedPostgreSQL/PostgreSQL frames, so they are unrelated to this change. They will need a rerun once the in-progress cycle finishes (a rerun now is rejected while the run is queued/in-progress).

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Re-ran the only two red jobs on 4aa7c094Stress test (amd_tsan) and Stress test (arm_tsan) (workflow run 28461893177, --failed); both are queued again.

Both failures were Hung check failed, possible deadlock found with generic Poco::ThreadImpl::runnableEntry / __tsan_thread_start_func stacks (no MaterializedPostgreSQL / PostgreSQL frames) — the chronic master-wide Hung-check flake #107941, not caused by this change.

Status otherwise: net diff still exactly the 2 PR files (MaterializedPostgreSQLConsumer.cpp +50, test_3.py +64; +111/-3); AI Review ✅ Approve, approved by @kssenii, 0 unresolved threads; MERGEABLE. CH Inc sync is still in progress (private #60882). Ready to merge once the Stress rerun is green.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr status — no code change, re-triggered the flaky Stress reruns.

The previous rerun (workflow run 28461893177) completed and both jobs failed again with the same Hung check failed, possible deadlock found — the chronic master-wide flake #107941 (which the CI bot itself links). The stacks are the generic AST-fuzzer path (executeASTFuzzerQueriesTCPHandler::runImplexecuteQuery) with no MaterializedPostgreSQL/PostgreSQL frames, so they are unrelated to this change.

This is TSan-specific: on the same head 4aa7c094 all 9 other Stress variants are greenamd_debug, amd_asan_ubsan, amd_msan, arm_debug, arm_asan_ubsan, arm_asan_ubsan (s3), arm_msan, arm_ubsan, arm_release — only amd_tsan + arm_tsan hit the Hung check.

Re-ran only those two jobs again (gh run rerun 28461893177 --failed, now queued). No re-merge: the merge-base (75154b382fe, 06-30 15:52Z) is ~11h old and the branch is MERGEABLE; re-merging would only cancel the in-progress CH Inc sync and re-roll the same flake.

Net diff unchanged — exactly the two PR files (MaterializedPostgreSQLConsumer.cpp +50, test_3.py +64; +111/-3). AI Review ✅ Approve, @kssenii APPROVED, 0 unresolved review threads. Blocked solely on a green Stress rerun (or a merge-past the tracked flake) + maintainer merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr status — no code change; reruns resolved, CH Inc sync now green, ready to merge past the tracked flake.

The two Stress reruns queued in the previous pass (workflow run 28461893177, --failed) have completed (~03:29–03:30Z) and failed again with the same Hung check failed, possible deadlock found — the chronic master-wide flake #107941 (still OPEN, CI bot itself links it). The stacks are the generic Poco::ThreadImpl::runnableEntry / AST-fuzzer analyzer path (QueryAnalyzer::resolveFunctionresolveExpressionNodeIExecutableFunction::execute) with no MaterializedPostgreSQL/PostgreSQL frames, so unrelated to this change. On the same head all other Stress variants are green — only amd_tsan + arm_tsan hit the Hung check.

What changed since the last update:

  • CH Inc sync is now ✅ SUCCESS (was in progress on private Logical error: 'No responses set for MultiRead'. #60882).
  • Mergeable Check is ✅ SUCCESS — every required check is green; the PR is MERGEABLE/UNSTABLE, i.e. the only red (Stress test (amd_tsan)/(arm_tsan)) is non-required.

No re-merge: the merge-base (75154b382fe, 06-30 15:52Z) is <1 day old, the branch is MERGEABLE, and master has touched neither PR file since — re-merging would only cancel the freshly-green CH Inc sync and re-roll the same flake. Head unchanged 4aa7c094; net diff still exactly the two PR files (MaterializedPostgreSQLConsumer.cpp +47/-3, test_3.py +64; +111/-3).

AI Review ✅ Approve, @kssenii APPROVED, 0 unresolved review threads. Blocked solely on a maintainer merge — the remaining red is the non-required tracked flake, so this can be merged past it.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, fix failures of the Hung check.

@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov Confirmed, and the fix is already merged: it is your own #108892.

Both Stress test (amd_tsan) and (arm_tsan) on 4aa7c094 have exactly one hung query each, identical: SELECT formatQuery($$SELECT [[[[ ... ~13k-deep nested array ... ]]]]$$) (40 KB literal), elapsed 1524s, is_cancelled 1, read_rows 0, no MaterializedPostgreSQL/PostgreSQL frames. Not caused by this PR (agreed), but it is the deep-nested-literal parse hang specifically, not a generic AST-fuzzer stack.

That is exactly what #108892 ("Parse nested literal collections in linear time") fixes: ParserCollectionOfLiterals::parseImpl was quadratic in nesting depth, so a ~13k-deep [[[...]]] took ~25 min to parse and the parser never checks cancellation mid-loop (hence is_cancelled 1 while still running).

The catch: this head predates the fix. #108892 merged 06-30 19:20Z; 4aa7c094 was cut 06-30 17:00Z (merge-base 75154b382fe, 15:52Z), so #108892 is not an ancestor of it (git compare = diverged). Every rerun --failed rebuilds the same pre-#108892 head, which is why it keeps hitting the hang.

Recommendation: merge past the non-required Hung check. #108892 is already in master, so post-merge master will not regress on this signature, and this avoids re-rolling the now-green CH Inc sync (your earlier concern). The master Stress-tsan Hung-check rate already collapsed from 48 on 06-30 to 1 on 07-01, right at the #108892 merge.

Alternative, if you want a green Stress run on record: re-merge master. Unlike the previous reruns, that now picks up #108892 and will clear it. Defense-in-depth for the same class (cap the recursion-controlling parser settings + honor cancellation) is in flight in #108863, tracked under #107941.

Reports:

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr — re-merged master to clear the tracked Stress Hung-check flake.

Following @groeneai's analysis: the only two red checks (Stress test (amd_tsan) and (arm_tsan), both Hung check failed, possible deadlock found) were the deep-nested-literal parse hang, not caused by this PR, and already fixed on master by #108892 ("Parse nested literal collections in linear time", merged 06-30 19:20Z). The previous head 4aa7c094 (cut 17:00Z, merge-base 15:52Z) predated that fix, so every rerun --failed kept rebuilding the pre-fix binary and re-hitting the hang.

Took the recommended "green run on record" path: merged origin/master into the branch (4aa7c094b04..d33693394c1). The merge is cleanmaster did not touch either PR file, and the net diff vs master is unchanged (only src/Storages/PostgreSQL/MaterializedPostgreSQLConsumer.cpp +50 and tests/integration/test_postgresql_replica_database_engine/test_3.py +64). The new head now contains #108892 (c09dfca7068), so the fresh Stress-tsan cycle should clear this signature.

No code change. CH Inc sync, Mergeable Check, and Style check were already green; reviewDecision is still empty (awaiting a human approval). Blocked on fresh CI + human approve + merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr — the re-merge worked: the Stress Hung check flake is cleared, all Stress variants are green.

Confirming the outcome of the previous re-merge (d3369339, which pulled in #108892). On this head all 12 Stress variants have now completed ✅ SUCCESS — including Stress test (amd_tsan) and Stress test (arm_tsan), which were the only red across the last ~10 CI cycles (Hung check failed, possible deadlock found, the master-wide flake #107941).

This confirms @groeneai's diagnosis: the Hung check was the quadratic nested-literal parse, fixed on master by #108892 (c09dfca7068). The prior head 4aa7c094 predated that fix, so every gh run rerun --failed rebuilt the pre-fix binary and re-hung — which is why the earlier reruns kept failing on the same two TSan jobs.

Current CI on d3369339: 0 failures. Style check ✅ SUCCESS; CH Inc sync in progress (private). The remaining 32 jobs (17 Stateless, 8 Integration, 6 Performance Comparison, 1 SQLLogic) are still routinely in progress. @kssenii APPROVED, AI Review ✅ Approve, 0 unresolved review threads; the net diff is still exactly the two PR files (MaterializedPostgreSQLConsumer.cpp +47/-3, test_3.py +64; +111/-3).

Ready to merge once the remaining routine shards finish green.

@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered: 16/29 (55.17%) · Uncovered code

Full report · Diff report

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov merged commit b00c7d7 into master Jul 1, 2026
173 of 174 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-matpg-column-mismatch-restart branch July 1, 2026 21:05
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 1, 2026
alexey-milovidov added a commit that referenced this pull request Jul 2, 2026
Resolve conflict in test_postgresql_replica_database_engine/test_3.py: both
this PR (backup tests: test_backup_database, test_backup_table_engine,
test_backup_table_partitions) and master's PR #107427
(test_table_schema_changed_while_server_down) added new test functions at the
same location. Keep both sets. The three C++ files auto-merged cleanly; their
changes occupy disjoint regions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MaterializedPostgreSQL connection breaks on column changes in postgres after restart

4 participants