Fix Iceberg REST catalog commit safety under post-commit network failure by il9ue · Pull Request #1982 · Altinity/ClickHouse · GitHub
Skip to content

Fix Iceberg REST catalog commit safety under post-commit network failure#1982

Closed
il9ue wants to merge 5 commits into
Altinity:antalya-26.3from
il9ue:fix/iceberg-commit-safety-26.3
Closed

Fix Iceberg REST catalog commit safety under post-commit network failure#1982
il9ue wants to merge 5 commits into
Altinity:antalya-26.3from
il9ue:fix/iceberg-commit-safety-26.3

Conversation

@il9ue

@il9ue il9ue commented Jun 29, 2026

Copy link
Copy Markdown

Found while working on #1609 (TRUNCATE storage-leak cleanup).

RestCatalog::updateMetadata returned a boolean result, which folded two outcomes that need different handling into one value:

  • the commit may have succeeded with the response lost, or
  • the catalog may have rejected it cleanly.

Both came back as false, so the caller ran cleanup in either case and deleted the manifest/data files the live snapshot still references. That's the same cleanup-on-failure pattern I'd otherwise be carrying into TRUNCATE.

  • INSERT corruption reproduces on 26.3.10. Mutations aren't reachable via SQL on these builds, and TRUNCATE on 26.3 stays safe only because of commit e229240.

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix possible data loss in Iceberg writes via REST catalog when a commit response is lost

Documentation entry for user-facing changes

When the network drops the response to a catalog commit that actually succeeded, ClickHouse retries the commit, sees the retry fail, and runs cleanup that deletes object-storage files the catalog at that point still references, leading to possible table corruption. The fix runs cleanup only when the commit was cleanly rejected; ambiguous outcomes leave the files in place.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

Tests

Two integration tests under tests/integration/test_storage_iceberg_no_spark/, both asserting one invariant: no catalog-referenced snapshot may point at a manifest list missing from object storage.

  • test_insert_commit_response_loss_is_handled : the fix, end to end. A fault proxy lets the commit POST reach the catalog (so it lands) then drops the response; ClickHouse retries and gets a 409. With the fix the INSERT completes, no snapshot loses its manifest list, and the row is readable. Without the fix this same test reproduces the corruption.
  • test_insert_without_fault_is_clean_baseline : control. With no fault armed the same INSERT must succeed cleanly, proving the corruption above comes from the injected response loss, not from the build or the harness.

Files:

  • test_catalog_commit_safety.py : the two tests above.
  • catalog_fault_proxy.py : a transparent reverse proxy in front of the REST catalog. Forwards everything by default; when armed, it forwards a commit POST upstream (so the catalog commits) then RSTs the client without returning the response, reproducing the "commit succeeded, response lost" window.
  • docker_compose_rest_proxy.yml : runs the proxy as a rest-proxy service on the cluster network and publishes its control port.
  • conftest.py : merges that compose file into the cluster so the proxy starts with the suite. ClickHouse talks to the catalog through the proxy; pyiceberg talks to the real catalog directly, so verification sees true committed state.

On INSERT via a REST catalog, a commit whose response is lost was retried,
got a 409, collapsed to a bool false, and triggered cleanup that deleted the
files the now-live snapshot referenced. Replace the bool with a typed
CommitOutcome and, on failure, re-read the catalog to classify whether our
snapshot actually landed before deciding to clean up.
@il9ue

il9ue commented Jun 29, 2026

Copy link
Copy Markdown
Author

@Altinity Altinity deleted a comment from chatgpt-codex-connector Bot Jul 2, 2026
@mkmkme

mkmkme commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b837ae8819

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"preserving written files to avoid corrupting a possibly-committed snapshot. "
"Orphaned files may need manual cleanup if the commit did not actually land.",
namespace_name, table_name);
return false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort ambiguous insert commits instead of retrying

When CommitOutcome::Unknown means the first catalog POST may already have landed but the verification read failed, returning false sends control back to IcebergStorageSink::finalizeBuffers' retry loop. At this point generateNextMetadata has already mutated metadata and the writers still hold the same data files, so the next initializeMetadata attempt can append those same files again on top of the possibly-committed snapshot instead of surfacing the ambiguous result to the user.

Useful? React with 👍 / 👎.

"Iceberg mutation commit for {}.{} is of unknown status after a lost response; "
"preserving written files to avoid corrupting a possibly-committed snapshot.",
namespace_name, table_name);
return false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort ambiguous mutation commits instead of retrying

For CommitOutcome::Unknown, the original mutation commit may have succeeded, but returning false makes the outer mutation loop retry from the latest catalog state. If the first commit did land, a retry can apply the same DELETE/UPDATE a second time (for example an UPDATE x = x + 1 can be applied again) rather than preserving files and failing the ambiguous operation, so this should not be treated like a clean conflict retry.

Useful? React with 👍 / 👎.

il9ue and others added 2 commits July 3, 2026 05:01
Follow-ups on the commit-safety fix.

Treat CommitOutcome::Unknown as a hard stop rather than a retryable false: a
lost-response commit may have already landed, so re-entering the transaction
retry loop would double-apply the write. Abort with DATALAKE_DATABASE_ERROR
(matching the retry-exhaustion convention) and preserve written files, guarding
the mutation cleanup path so the abort never deletes a possibly-committed
snapshot's data.

Also catch Poco::TimeoutException on the commit POST so a timed-out response is
classified through the same re-read path instead of leaking.

Adds an integration test that drives the Unknown path and asserts the write
aborts without double-applying and without deleting files.
@il9ue

il9ue commented Jul 3, 2026

Copy link
Copy Markdown
Author

Pushed a follow-up addressing the (chatgpt bot) review feedback.
The original fix classified a failed commit as Committed / RejectedCleanly / Unknown and cleaned up only on RejectedCleanly. The gap was Unknown : it also returned false, which the transaction retry loop reads as "retry", so a commit that may have already landed could be retried and double-applied.

This follow-up makes Unknown a hard stop:

  • Throws instead of returning false, so the operation aborts rather than retrying, and preserves the written files.
  • Same handling across the INSERT, import-partition, and mutation call sites.
  • The mutation cleanup path is guarded so the abort never deletes files.

The error code is DATALAKE_DATABASE_ERROR, matching what the retry loop already raises on exhaustion. It also catches Poco::TimeoutException on the commit POST so a timed-out response goes through the same re-read classification instead of leaking.

Added an integration test for the Unknown path (verified at local)

@il9ue

il9ue commented Jul 3, 2026

Copy link
Copy Markdown
Author

@il9ue il9ue closed this Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants