MaterializedPostgreSQL: don't break all replication on one table's structure mismatch#107427
Conversation
…e's structure mismatch When the structure of a replicated PostgreSQL table changes while it is not observed through the replication stream (for example, a column is added or dropped in PostgreSQL and then the server restarts), the consumer's `StorageData` constructor throws `LOGICAL_ERROR: Columns number mismatch`. This exception was thrown while building the consumer for the whole database, so a single out-of-sync table stopped replication of *all* tables, and the database could not recover without being dropped and recreated. Catch the error per table when constructing the consumer: the affected table is skipped (with a clear message asking to DETACH/ATTACH it) while the rest of the tables keep replicating. This mirrors how structure changes detected through the replication stream are already handled. Closes: #66273 Related: #49045 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… the catch The "Columns number mismatch" error is an expected condition (the PostgreSQL table structure no longer matches the nested ClickHouse table), not an internal logic error. Throwing it as `LOGICAL_ERROR` aborted the server in debug/sanitizer builds at exception-construction time, before the catch in the consumer constructor could skip the table — which is why the integration test crashed the server in CI. Throw `POSTGRESQL_REPLICATION_INTERNAL_ERROR` instead, and narrow the catch in the consumer constructor to that specific error so unrelated initialization failures still propagate (as suggested in review). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve conflict in tests/integration/test_postgresql_replica_database_engine/test_3.py: both branches appended new test functions at the same location. Keep all three — this PR's test_table_schema_changed_while_server_down regression test plus master's test_numeric_to_int256 and test_numeric_int256_validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Re-merged the latest The only red check was
AI Review verdict is ✅ Approve, @kssenii approved, and there are no unresolved review threads. |
…smatch-restart # Conflicts: # tests/integration/test_postgresql_replica_database_engine/test_3.py
…test_3.py
The 06-24 merge of `master` into this branch rewrote the import block of
`tests/integration/test_postgresql_replica_database_engine/test_3.py`
(master had refactored those imports) and dropped the
`assert_number_of_columns` import, while the test body added by this PR
still calls it. `ruff` failed with:
test_3.py:1072:5: F821 Undefined name `assert_number_of_columns`
Re-add the import. `assert_number_of_columns` is defined in
`tests/integration/helpers/postgres_utility.py`.
Style check report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107427&sha=de16eaaf5e360e25dffab6b674e458bee1c8c7b7&name_0=PR&name_1=Style%20check
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…recovery When a table is skipped from replication (for example, skipped at startup because its structure no longer matched the nested ClickHouse table, as added in this PR, or skipped after a structure change observed in the replication stream), and then receives WAL before the user repairs it, the consumer processes the table's `Relation` message, finds no storage for it, and calls `markTableAsSkipped`, which records an empty `skip_list` entry keyed by the PostgreSQL relation id. The promised `DETACH`/`ATTACH` recovery went through `removeNested`/`addNested`, neither of which cleared that `skip_list` entry. After the `waiting_list` entry set by `addNested` was consumed by the first post-attach message, `isSyncAllowed` saw the stale empty `skip_list` entry and returned `false`, so all further changes to the reattached table were skipped indefinitely and replication never resumed. Clear `skip_list` entries whose relation id maps to the table being brought back in `addNested`, so the recovery actually works. The regression test now writes to the affected table before `DETACH`/`ATTACH` (waiting until it is really marked as skipped in the replication stream) and again after `ATTACH`, proving that replication resumes instead of only that the snapshot reload matches once.
|
Addressed the AI Review "Request changes" finding (stale Fix. When a startup-skipped table later receives WAL, the consumer's Test. The regression test now writes to the affected table before The modified |
|
Merged the latest The three previously red checks are all unrelated to this change:
|
|
Merged the latest The two previously red checks are both the same unrelated, chronic master-wide flake:
This PR's own integration test ( |
|
The latest CI cycle (commit
Not re-merging |
|
The latest completed red is now The five flagged lines all come from a single stateless test ( When the server is restarted during the upgrade test, those still-attached CIDB confirms it is a broad master-wide flake, not specific to this PR: The other completed reds are the same unrelated flakes noted above:
@kssenii approved, AI Review is ✅ Approve, and there are no unresolved review threads. |
|
continue-pr status — re-merged current master. The branch had drifted 157 commits behind master (merge-base from Review state unchanged: @kssenii APPROVED, AI Review ✅ Approve, 0 unresolved review threads. All CI reds remain the same tracked, master-wide flakes — none touches the PostgreSQL replication path, and re-merging cannot clear them until their fixers land:
Blocked solely on those fixers landing + a maintainer merge. |
|
continue-pr status — no code change, blocked on CI completion + merge. Re-checked the post-merge head Review state: AI Review ✅ Approve (re-run on this head at Independent correctness check of the trickiest path (stale Only completed reds are the two |
|
Re-ran the only two red jobs on Both failures were Status otherwise: net diff still exactly the 2 PR files ( |
|
continue-pr status — no code change, re-triggered the flaky Stress reruns. The previous rerun (workflow run 28461893177) completed and both jobs failed again with the same This is TSan-specific: on the same head Re-ran only those two jobs again ( Net diff unchanged — exactly the two PR files ( |
|
continue-pr status — no code change; reruns resolved, The two Stress reruns queued in the previous pass (workflow run 28461893177, What changed since the last update:
No re-merge: the merge-base ( AI Review ✅ Approve, @kssenii APPROVED, 0 unresolved review threads. Blocked solely on a maintainer merge — the remaining red is the non-required tracked flake, so this can be merged past it. |
|
@groeneai, fix failures of the Hung check. |
|
@alexey-milovidov Confirmed, and the fix is already merged: it is your own #108892. Both That is exactly what #108892 ("Parse nested literal collections in linear time") fixes: The catch: this head predates the fix. #108892 merged 06-30 19:20Z; Recommendation: merge past the non-required Hung check. #108892 is already in master, so post-merge master will not regress on this signature, and this avoids re-rolling the now-green Alternative, if you want a green Stress run on record: re-merge master. Unlike the previous reruns, that now picks up #108892 and will clear it. Defense-in-depth for the same class (cap the recursion-controlling parser settings + honor cancellation) is in flight in #108863, tracked under #107941. Reports:
|
|
continue-pr — re-merged Following @groeneai's analysis: the only two red checks ( Took the recommended "green run on record" path: merged No code change. |
|
continue-pr — the re-merge worked: the Stress Confirming the outcome of the previous re-merge ( This confirms @groeneai's diagnosis: the Hung check was the quadratic nested-literal parse, fixed on Current CI on Ready to merge once the remaining routine shards finish green. |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 16/29 (55.17%) · Uncovered code |
Resolve conflict in test_postgresql_replica_database_engine/test_3.py: both this PR (backup tests: test_backup_database, test_backup_table_engine, test_backup_table_partitions) and master's PR #107427 (test_table_schema_changed_while_server_down) added new test functions at the same location. Keep both sets. The three C++ files auto-merged cleanly; their changes occupy disjoint regions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When a column is added to or dropped from a replicated PostgreSQL table while the change is not seen through the replication stream — e.g. the column changes and then the ClickHouse (or PostgreSQL) server restarts —
MaterializedPostgreSQLstopped replicating all tables in the database, with:Root cause. On startup the consumer builds a
StorageDatafor every table; its constructor compares the freshly fetched PostgreSQL column count with the existing nested ClickHouse table and throwsColumns number mismatchon a mismatch. That throw happened in the loop that constructs the consumer for the whole database, so one out-of-sync table aborted startup of the entire database, and it could not be recovered without dropping and recreating it.Fix. Catch the error per table while constructing the consumer. The affected table is skipped (with a clear log message asking the user to
DETACH/ATTACHit to reload its structure), while the other tables keep replicating. This mirrors how structure changes that arrive through the replication stream are already handled (the table is marked as skipped, not fatal).The added integration test changes the structure of one table in PostgreSQL, restarts ClickHouse, and verifies that (1) the other tables keep replicating and (2) the affected table is recoverable with
DETACH/ATTACH.Closes: #66273
Related: #49045
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix
MaterializedPostgreSQLstopping replication of an entire database (withLOGICAL_ERROR: Columns number mismatch) when a single replicated table's structure changed while the server was down. Now only the affected table is skipped and can be recovered withDETACH/ATTACH.Documentation entry for user-facing changes
🤖 Generated with Claude Code
Version info
26.7.1.382