Fix exception when Iceberg format version is upgraded by external tool by alexey-milovidov · Pull Request #100407 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix exception when Iceberg format version is upgraded by external tool#100407

Merged
alexey-milovidov merged 72 commits into
masterfrom
fix-iceberg-format-version-upgrade
Jun 13, 2026
Merged

Fix exception when Iceberg format version is upgraded by external tool#100407
alexey-milovidov merged 72 commits into
masterfrom
fix-iceberg-format-version-upgrade

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Mar 23, 2026

Copy link
Copy Markdown
Member

When external tools like Spark upgrade the Iceberg format version (e.g. from v1 to v2), the format_version cached in PersistentTableComponents becomes stale. The previous chassert in getState and getHistory then triggered a logical error because the cached value no longer matched the metadata file.

Instead of mutating shared cached state from const methods (which would introduce a cross-query data race), this PR derives the parsing format version from each manifest list / manifest file's own Avro format-version metadata. A v2 table can therefore still reference v1 manifests left behind by an external upgrade, and each one is parsed correctly.

For backward compatibility with older ClickHouse releases that wrote manifests without the format-version Avro key, falls back to schema-based detection: the sequence_number field appears at the top level of v2 manifest lists / manifest entries, but is absent from their v1 counterparts. New manifests written by ClickHouse now also include the format-version Avro metadata key.

The cached format_version in PersistentTableComponents remains const (per-table value observed when the table was opened) and is no longer compared against the latest metadata file.

Fixes #86776

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix logical error exception when reading Iceberg tables whose format version was upgraded by an external tool (e.g. Spark).

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Note

Medium Risk
Changes Iceberg manifest list/file parsing and remove_orphan_files gating to stop relying on cached table format_version, which could affect compatibility across v1/v2 manifests if the new per-file detection is wrong.

Overview
Fixes failures when an Iceberg table’s format-version is upgraded externally (e.g. Spark v1→v2) by stopping reliance on the cached PersistentTableComponents::format_version for parsing and instead deriving the version from each manifest list / manifest file’s Avro metadata.

AvroForIcebergDeserializer::getFormatVersionFromManifestFileMetadata() now falls back to schema-based detection (presence of top-level sequence_number) when the format-version Avro key is missing, and writers now emit format-version metadata for new manifest lists/files. Assertions that the cached version matches the latest metadata are removed, ManifestFileIterator::create no longer takes a table-wide version, and EXECUTE remove_orphan_files re-reads the latest metadata to enforce its v2-only requirement.

Adds an integration test covering external upgrades, concurrent reads during upgrade, and the no-format-version fallback behavior.

Reviewed by Cursor Bugbot for commit 12aee57. Bugbot is set up for automated code reviews on this repo. Configure here.

Version info

  • Merged into: 26.6.1.741

When external tools like Spark upgrade the Iceberg format version (e.g.
from v1 to v2), the `format_version` cached in `PersistentTableComponents`
becomes stale. The `chassert` in `getState` and `getHistory` would then
trigger a logical error because the cached value no longer matches the
metadata file.

Replace the assertions with updates to the cached `format_version`, so
that downstream code (manifest parsing, etc.) always uses the correct
version. The field is changed from `const` to `mutable` to allow
updating it from const methods.

Fixes: #86776

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Mar 23, 2026
@alexey-milovidov alexey-milovidov added pr-bugfix Pull request with bugfix, not backported by default and removed pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-critical-bugfix labels Mar 23, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member Author

The field is changed from const to mutable to allow updating it from const methods.

Thread safety?

Comment thread src/Storages/ObjectStorage/DataLakes/Iceberg/PersistentTableComponents.h Outdated
alexey-milovidov and others added 5 commits March 24, 2026 00:30
… `std::atomic`

The `format_version` field can be updated when external tools upgrade
the Iceberg format version, but can also be read concurrently from
multiple queries. Using `mutable Int32` without synchronization is a
data race (undefined behavior). Replace with `std::atomic<Int32>`.

Since `std::atomic` is neither copyable nor movable, add explicit
constructors for `PersistentTableComponents`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…version-upgrade

# Conflicts:
#	src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergMetadata.cpp
#	src/Storages/ObjectStorage/DataLakes/Iceberg/PersistentTableComponents.h
Adds an integration test that verifies ClickHouse can read an Iceberg
table after its `format-version` has been upgraded from v1 to v2 by an
external tool (simulated by directly modifying the metadata JSON).
This is a regression test for #86776

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergMetadata.cpp Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member Author

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The flaky check failure is fixed in #102148, let's update the branch.

alexey-milovidov and others added 8 commits April 9, 2026 20:49
When an external tool (e.g. Spark) upgrades the Iceberg format version
from v1 to v2, existing manifest files are left in v1 format (per the
Iceberg specification, their entries are interpreted as having
`sequence_number = 0`).  `ManifestFileIterator::create` was checking the
table's cached `format_version`, so it rejected these legacy manifests
with an `ICEBERG_SPECIFICATION_VIOLATION` exception claiming that the
required `sequence_number` column was missing.

Use the manifest file's own Avro metadata to determine how to parse it,
matching what `AvroForIcebergDeserializer::createParsedManifestFileEntry`
already does for entry-level decoding.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100407&sha=1299abf41df6d00d2744ca5a67931bf136c0aa15&name_0=PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback: `PersistentTableComponents` is shared across
concurrent queries, so mutating `format_version` (even via `std::atomic`)
introduced a cross-query correctness hazard — query A could observe the
`format-version` written by query B while still parsing A's own metadata
structures.

Revert the field to `const Int32` and drop the assignments in `getState`
and `getHistory`. Parsing now relies on the format version recorded in
each manifest list / manifest file's own Avro metadata, which
`getFormatVersionFromManifestFileMetadata` already returns: the previous
commit already wired this into `ManifestFileIterator::create`, and this
commit extends the same approach to `getManifestList` so that a v1
manifest list left over after a Spark v1 -> v2 upgrade is not read with
v2 rules.
The previous commits moved away from mutating the cached `format_version`
in `PersistentTableComponents` toward reading the format version from each
manifest list / manifest file's own Avro metadata. Update the test's
docstring accordingly so the intent matches the current implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexey-milovidov and others added 7 commits May 20, 2026 16:58
…tion

The OAuth token used to push from this environment lacks the `workflow`
scope, so any commit that adds or modifies a file under `.github/workflows/`
is rejected on push. Removing the master-only `optimize_clickhouse.yml` from
the merge commit restores the workflow tree to what was already in the PR.
The prior commit deleted/modified workflow files in this branch as a
workaround for GitHub's workflow-scope check. Restoring them to match
`master` so the PR diff no longer contains unrelated changes.
The `test_format_version_upgrade_concurrent_reads` test submits infinite
reader loops to a `ThreadPoolExecutor`, then sets `stop["flag"]` only after
the main-thread upgrade path succeeds. If `_read_iceberg_metadata`, the
`format-version` assertion, or `_write_iceberg_metadata` raised, the flag was
never set, so the `ThreadPoolExecutor` shutdown on leaving the `with` block
waited forever on the reader loops and the test hung until the global CI
timeout instead of failing fast.

Set `stop["flag"]` in a `finally` so the readers always stop. Addresses the
AI Review finding reported by the `CH Inc sync` job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 87365837b95: in test_format_version_upgrade_concurrent_reads, the infinite reader loops are submitted to a ThreadPoolExecutor, but stop["flag"] was set only after the main-thread upgrade path (_read_iceberg_metadata / the format-version assertion / _write_iceberg_metadata) succeeded. If any of those raised, the flag was never set and the ThreadPoolExecutor shutdown on leaving the with block would wait forever on the reader loops, hanging the test until the global CI timeout instead of failing fast. The flag is now set in a finally. This addresses the actionable AI-Review finding surfaced by the CH Inc sync job.

The other failures in the latest sync run (clickhouse-private#53314) are not caused by this PR:

  • Integration tests (..., db disk, 4/6)test_storage_kafka/test_batch_fast.py::test_kafka_engine_put_errors_to_stream_with_random_malformed_json flaky AssertionError: assert 54 == 64 (Kafka, untouched by this PR).
  • Integration tests (..., db disk, old analyzer, 2/6)The self-hosted runner lost communication with the server (infra).
  • The second AI-Review finding (src/IO/S3/Credentials.cpp web-identity resolution) is a pre-existing master finding on code this PR does not touch.
  • Stateless tests (..., meta in keeper, s3 storage, ...) — recurring sync-side flakiness as noted above.

@groeneai, the remaining CH Inc sync failures above are unrelated to this PR (flaky Kafka test, lost-runner infra error, pre-existing S3 web-identity AI-Review finding, and meta-in-keeper/s3 stateless flakiness). Please investigate and provide a fix in a separate PR if one is needed; link it here if already in progress.

@groeneai

groeneai commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov, investigated. Findings below, sourced from CIDB (default.checks) and our CI tracker.

Class 1. test_kafka_engine_put_errors_to_stream_with_random_malformed_json (assert 54 == 64, Integration tests db disk 4/6): chronic sanitizer-only flake. CIDB 14d shows 8 hits across amd_tsan, amd_asan_ubsan db disk old analyzer, amd_msan, amd_llvm_coverage (4 master + 4 PR). Pre PR #105138 (merged 2026-05-17) the weekly peak was 18/wk; post merge it dropped to ≤4/wk and only on sanitizers. Both 54 == 64 (10 error rows missing) and 0 == 64 (whole _errors MV empty) shapes appear only on asan/msan/tsan. Already tracked under our umbrella for #104905. No fix on this PR; rerun the job.

Class 2. "self-hosted runner lost communication with the server" (Integration tests db disk old analyzer 2/6): one-off GitHub Actions runner heartbeat loss. CIDB 30d sweep returns 0 hits matching that string in test_context_raw, which confirms it is GHA-runner-level and never recorded as a test failure. Not actionable; rerun.

Class 3. AI-Review on src/IO/S3/Credentials.cpp:138 (web-identity settings merge): pre-existing master finding, unrelated. CIDB 30d shows 11 hits across 8 distinct PRs matching the S3/Credentials.cpp / web-identity pattern in test_context_raw, which is clear cross-PR signal on master code that this PR does not touch (it only modifies Iceberg files). Defer to S3 component owners (Kseniia / kssenii). Not a blocker for #100407; we can open a separate master issue if useful.

Class 4. "meta in keeper, s3 storage" stateless flakiness: sync-side label, not a public CIDB check_name. CIDB has no check_name containing "meta in keeper"; the closest public variants are Stateless tests (amd_tsan, s3 storage, ...), Stateless tests (amd_debug, distributed plan, s3 storage, ...), Stateless tests (amd_llvm_coverage, AsyncInsert / ParallelReplicas, s3 storage, ...). The specific failing test from sync run clickhouse-private#53314 is private-side and not visible to us. If the rerun surfaces it again, post the test name here and we'll cross-check our public umbrellas.

Bottom line: none of the four are caused by PR #100407. Class 1 is a known sanitizer-only chronic flake; Class 2 and Class 4 are infra / sync-side noise (rerun); Class 3 is a pre-existing master AI-Review finding scoped to the S3 component. No separate fix PR opened by us at this time. We will continue to monitor Class 1 and act if a master regression appears.

The actionable AI-Review finding addressed by your 87365837b95 (set stop["flag"] in a finally so ThreadPoolExecutor shuts down on exception) looks correct.

(Worker session: cron:clickhouse-ci-task-worker:20260605-030500)

@clickhouse-gh

clickhouse-gh Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

no significant changes detected. K_source=6 K_base=30 flagged=0/65

clickbench

🟢 No significant changes

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: f5d828d0-82d9-4b85-b805-24118a0f3d97
  • MIRAI run: 1237ba2e-b44e-46bf-987e-498f18eec4c3
  • PR check IDs:
    • clickbench_320239_1781234812
    • clickbench_320253_1781234812
    • clickbench_320259_1781234812
    • tpch_adapted_1_official_320266_1781234812
    • tpch_adapted_1_official_320277_1781234812
    • tpch_adapted_1_official_320339_1781234813

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged master (the branch was ~1072 commits behind) — clean merge, no conflicts. Rebuilt the 7 affected Iceberg translation units (AvroForIcebergDeserializer, IcebergIterator, IcebergMetadata, IcebergWrites, ManifestFileIterator, RemoveOrphanFilesExecute, StatelessMetadataFileGetter) against current master — all compile cleanly with no warnings, so none of the 1072 intervening commits broke the APIs this PR uses.

The single CI failure (Performance Comparison (arm_release, 1/6)) is unrelated noise: the flagged queries are rand, set_index, vectorize_aggregation_combinators, and insert_select_default_small_block — none touch Iceberg, and there is no plausible mechanism by which Iceberg manifest parsing could affect them.

The one open review thread (@scanhex12 on Iceberg v3) is already addressed in the discussion: v3 manifests always carry the format-version Avro key per the Iceberg spec and take the first branch; the schema-based fallback only applies to legacy manifests written by older ClickHouse, which only ever produced v1/v2.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged master again (the branch was 42 commits behind). Clean merge, zero conflicts; the net PR diff is unchanged (11 files, +382/-26) and the core getFormatVersionFromManifestFileMetadata logic is intact.

This pulls in #106664, which fixes the Virtual row boundary violated in MergingSortedAlgorithm logical error that was one of the two failures in Stateless tests (amd_debug, parallel) — unrelated to this PR's Iceberg changes. The other failure, Server died, is the chronic master-wide flaky verdict (the targeted Iceberg tests all pass).

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, investigate the failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100407&sha=eebc7dbc7f028d3d22fa69b51b5e95aff18a1f34&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20db%20disk%2C%20old%20analyzer%2C%202%2F6%29 and provide a fix in a separate PR. If the fix is already in progress, link it here.

test_lost_part/test.py::test_lost_part_other_replica failed with NO_REPLICA_HAS_PART (merge entry queue-0000000005 postponed 156 times with exponential backoff until the test timed out). This PR only touches Iceberg manifest parsing, so the failure is unrelated; CIDB shows the same test also failed on an unrelated branch on 2026-05-25, and I found no open issue tracking it.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged master into the branch (clean, no conflicts — commit 334b728). The only red in the last CI run was the unrelated test_lost_part_other_replica flake reported above; all Iceberg tests passed. The branch had fallen ~830 commits behind, and master's recent Iceberg changes (#105682, version-hint sync) touch no files modified by this PR.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov fix is up in a separate PR: #106984

Root cause (not related to this Iceberg PR, as you noted): test_lost_part_other_replica drains the replication queue with a fixed 10x1s loop after lost-part recovery. Recovery produces transiently-failing entries: GET_PART all_2_2_0 -> PART_IS_TEMPORARILY_LOCKED (the recovered part is Outdated and being cleaned up) and the dependent MERGE_PARTS all_0_4_1 -> NO_REPLICA_HAS_PART until the fetch lands. ReplicatedMergeTreeQueue::getPostponeTimeMsForEntry parks each failed entry for 1 << num_tries ms capped at max_postpone_time_for_failed_replicated_{fetches,merges}_ms (default 60000). In the failed run queue-0000000005 reached num_postponed=156 with a 29383ms backoff, well past the 10s window, so the queue was still non-empty and the test hit assert False at test.py:160.

Fix sets max_postpone_time_for_failed_replicated_fetches_ms=0 and max_postpone_time_for_failed_replicated_merges_ms=0 in test_lost_part_same_replica, test_lost_part_other_replica, and test_lost_part_mutation, mirroring the existing max_postpone_time_for_failed_mutations_ms=0. test_lost_last_part is left as is (tolerant 50s loop).

Reproduced deterministically with the replicated_queue_fail_next_entry failpoint (forcing num_tries>=14 leaves the queue non-empty after 10s without the setting, empty with it). Fixed suite: 12/12 (4 tests x 3 runs). The 2026-05-25 hit you found is a different failure mode (NETWORK_ERROR: Connection reset by peer during DROP TABLE, infra) and not addressed here.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged master into the branch again (clean, zero conflicts — commit ad565d4, the branch was ~107 commits behind). The net PR diff is unchanged (11 files, +382/−26), and all four affected Iceberg translation units (IcebergMetadata.cpp, IcebergWrites.cpp, AvroForIcebergDeserializer.cpp, RemoveOrphanFilesExecute.cpp) compile clean against the merged tree, including the two files master had touched in the meantime.

The sole red in the last run, Stateless tests (arm_asan_ubsan, azure, parallel) (6 failed tests), was a runner-wide memory exhaustion event, not related to this PR: at 01:54:24 the runner hit the total-memory ceiling (memory limit exceeded: would use 3.37 GiB ... current RSS: 53.72 GiB, maximum: 53.69 GiB on a trivial SELECT map('a', 1, 'b', 2) FROM remote('127.0.0.{1,2}') in 01787_map_remote), and the other five tests failed simultaneously with Connection reset by peer. All six passed on rerun (0/3 rerun failures), and none of them touch Iceberg code. CIDB shows this exact failure class spiked across the fleet on the same day: 544 failures in 82 PRs on 2026-06-10 and 367 in 29 PRs on 2026-06-09, versus single digits before.

@groeneai, investigate the failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100407&sha=334b728170416b5b0ce4ba250bfc2a09b149d2ac&name_0=PR&name_1=Stateless%20tests%20%28arm_asan_ubsan%2C%20azure%2C%20parallel%29 — the runner-wide total-memory exhaustion on stateless ASan runners (spiking 2026-06-09/10 across dozens of PRs) and provide a fix in a separate PR. If the fix is already in progress, link it here.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov Confirmed, and the root-cause fix is already merged.

Cited run (SHA 334b7281, Stateless tests (arm_asan_ubsan, azure, parallel)) - all 6 failures are graceful, no crash:

  • 01787_map_remote: Code 241, (total) memory limit exceeded ... current RSS: 53.72 GiB, maximum: 53.69 GiB on a trivial remote('127.0.0.{1,2}') query.
  • The other 5 (02317_distinct_in_order_optimization_explain, 02943_alter_user_modify_profiles_and_settings, 03315_query_log_privileges_backup_restore, 03815_debug_assertion_column_sizes_race, 04070_arrow_complex_types): Code 210 Connection reset by peer - the runner dropping connections at the RSS ceiling.
  • Crash-signature scan over all 6 test_context_raw: 0 SIGSEGV / 0 use-after-free / 0 allocator CHECK / 0 LOGICAL_ERROR. Genuine OOM + network cascade, not a masked crash. Unrelated to this PR's Iceberg manifest parsing.

CIDB corroborates the runner-wide spike on arm_asan_ubsan, azure, parallel (failure rows / distinct PRs per day): single digits through 06-07, then 634 / 40 PRs on 06-09 and 972 / 94 PRs on 06-10, back to 8 / 1 PR today (06-12).

Root-cause fix already merged: #106962 (Raul Marin, "ci: tag 01603_read_with_backoff_bug as no-asan to prevent cascading OOM", merged 06-11 16:38 UTC). 01603_read_with_backoff_bug was the heavy consumer driving the cascade on this lane (9 fails 06-09, 12 fails 06-10); it is now SKIPPED on the asan azure lane (76x on 06-11) and the lane's failure volume dropped 972 -> 170 -> 8. The directive's "if the fix is already in progress, link it here" is satisfied by #106962, so no separate PR is needed from my side.

This is the same azure runner-OOM family I analyzed earlier on #106187 and #101843.

@clickhouse-gh

clickhouse-gh Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.60% 84.60% +0.00%
Functions 92.40% 92.40% +0.00%
Branches 77.30% 77.30% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 55/96 (57.29%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 1 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov merged commit 4e6d7ec into master Jun 13, 2026
165 of 166 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-iceberg-format-version-upgrade branch June 13, 2026 12:41
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical error: 'persistent_components.format_version == metadata_object->getValue<int>(f_format_version)''

4 participants