Resolve problems with paths and compatibility problems with Spark in Azure (v2) by divanik · Pull Request #100420 · ClickHouse/ClickHouse · GitHub
Skip to content

Resolve problems with paths and compatibility problems with Spark in Azure (v2)#100420

Merged
divanik merged 3 commits into
masterfrom
divanik/rerevert_spark_azure_fixes
Mar 24, 2026
Merged

Resolve problems with paths and compatibility problems with Spark in Azure (v2)#100420
divanik merged 3 commits into
masterfrom
divanik/rerevert_spark_azure_fixes

Conversation

@divanik

@divanik divanik commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

This PR addresses several issues: fixes inconsistent path handling in Iceberg caused by mixed usage of storage paths and metadata paths; enforces that Iceberg tables write down a table location which is either a URL or an absolute path; adds a fallback for counting file sizes in Azure because some ClickHouse readers don't support byte counting after traversal; version-hint.txt is now handled in a manner compatible with Spark; introduces type-level abstractions that make it harder to mix up path types in the future; adds tests for Azure and Local that verify cross-engine interoperability without intermediate uploading/downloading; fixes usage of position deletes, which previously relied on path inference heuristics where that approach is inappropriate

Version info

  • Merged into: 26.4.1.203

…ik/add_proper_spark_conf_for_azure_and_local"

This reverts commit eef623a, reversing
changes made to 8f93f9a.
@clickhouse-gh

clickhouse-gh Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label Mar 23, 2026
Comment thread src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergPath.cpp
Comment thread src/Storages/ObjectStorage/DataLakes/Iceberg/Utils.cpp
@divanik divanik changed the title Revert "Merge pull request #100208 from ClickHouse/revert-99163-divan… Resolve problems with paths and compatibility problems with Spark in Azure (v2) Mar 23, 2026
@clickhouse-gh

clickhouse-gh Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.90% 83.90% +0.00%
Functions 24.60% 24.80% +0.20%
Branches 76.50% 76.50% +0.00%

PR changed lines: PR changed-lines coverage: 44.54% (657/1475, 0 noise lines excluded)
Diff coverage report
Uncovered code

@divanik divanik added this pull request to the merge queue Mar 24, 2026
Merged via the queue into master with commit 8268bbd Mar 24, 2026
292 of 300 checks passed
@divanik divanik deleted the divanik/rerevert_spark_azure_fixes branch March 24, 2026 14:07
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 24, 2026
alexey-milovidov added a commit that referenced this pull request Apr 30, 2026
- Replace TODO authoring notes with user-facing summaries (or drop
  entries that are not user-visible).
- Drop changelog item for `#101634` since the PR only adds a test.
- Drop changelog item for `#100315` since the PR only adds an
  internal `getLastWrittenObjectPath` method on `ObjectStorage`.
- Rewrite the duplicated "This PR addresses several issues..." entry
  for `#99163` / `#100420` as a single coherent sentence.
- Fix the `chdig` link target: the text said `v26.4.3` but the URL
  pointed to the `v26.4.1` tag. Both versions exist; the PR title is
  "Bump chdig to v26.4.3", so align the URL with the version.

PR: #103729
zvonand pushed a commit to Altinity/ClickHouse that referenced this pull request May 19, 2026
…solution in next commit)

---
Original cherry-pick message follows:

Merge pull request ClickHouse#100420 from ClickHouse/divanik/rerevert_spark_azure_fixes

Resolve problems with paths and compatibility problems with Spark in Azure (v2)

# Conflicts:
#	src/Interpreters/IcebergMetadataLog.cpp
#	src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergMetadata.cpp
#	src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergWrites.cpp
#	src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFileIterator.cpp
#	src/Storages/ObjectStorage/DataLakes/Iceberg/MultipleFileWriter.cpp
#	src/Storages/ObjectStorage/DataLakes/Iceberg/MultipleFileWriter.h
#	src/Storages/ObjectStorage/DataLakes/Iceberg/Mutations.cpp
#	src/Storages/ObjectStorage/DataLakes/Iceberg/PersistentTableComponents.h
#	src/Storages/ObjectStorage/DataLakes/Iceberg/Utils.cpp
#	src/Storages/ObjectStorage/DataLakes/Iceberg/Utils.h
zvonand added a commit to Altinity/ClickHouse that referenced this pull request May 19, 2026
zvonand added a commit to Altinity/ClickHouse that referenced this pull request May 20, 2026
Adapted PR ClickHouse#90740 (Read iceberg from various paths) to antalya-26.3
without applying the prerequisite upstream PR ClickHouse#100420 (IcebergPath /
path_resolver refactor). The refactor is dropped; raw `String` paths are
used instead.

Adaptations from PR 90740 to antalya-26.3:

- `IcebergPathFromMetadata` references → plain `String` (no `.serialize()`,
  no `IcebergPathFromMetadata::deserialize` wrapping).
- `IcebergPathResolver & path_resolver` parameters → `const String &
  table_location`. Calls like `path_resolver.resolve(x)` become `x`.
- `SecondaryStorages` infrastructure kept: thread-safe map of secondary
  object storages plus a `resolveObjectStorageForPath` helper that maps
  a metadata path to a (storage, key) pair. The IcebergPath-aware
  overload of `resolveObjectStorageForPath` was removed.
- New protocol version `DBMS_CLUSTER_PROCESSING_PROTOCOL_VERSION_WITH_ICEBERG_ABSOLUTE_PATH = 7`
  used in `IcebergObjectSerializableInfo::{serializeForClusterFunctionProtocol,
  deserializeForClusterFunctionProtocol}` to gate the new
  `data_object_file_metadata_path` field and `requires_external_storage`
  check; `_path` for delete files goes through `SchemeAuthorityKey` on
  older protocols.

Dropped (depend on upstream commits not on antalya-26.3):

- `ExpireSnapshotsExecute.{cpp,h}`, `RemoveOrphanFilesExecute.{cpp,h}`,
  `SnapshotFilesTraversal.{cpp,h}` — extracted EXECUTE handlers from
  upstream PR introducing per-command refactor. PR 90740 only threads
  `secondary_storages` into these; the underlying refactor is a
  separate dependency. The antalya-26.3 `Iceberg::expireSnapshots` path
  is kept unchanged in `IcebergMetadata::executeCommand`.
- `executeExpireSnapshots` / `executeRemoveOrphanFiles` dispatch in
  `IcebergMetadata::executeCommand` — depends on the dropped files.

References:

- Upstream PR: ClickHouse#90740
- antalya-26.1 backport (used as a structural reference for the
  no-IcebergPath adaptation): 0520e2e
  ("Allow to read iceberg table data from any location")

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
il9ue pushed a commit to Altinity/ClickHouse that referenced this pull request May 26, 2026
The path resolution refactor in ClickHouse#100420/ClickHouse#100295 changed `FileNamesGenerator`'s
constructor signature (5 args → 4 args, with `is_transactional` mapped to
`use_uuid_in_metadata` and `config_path` removed) and removed
`generateMetadataName()` in favour of `generateMetadataPathWithInfo()`.
The `generateManifestList` signature also changed: first arg is now
`IcebergPathResolver` instead of `FileNamesGenerator`, and manifest entry
sizes is `std::vector<Int64>` instead of a scalar.

Adapt the `TRUNCATE TABLE` code path (from #1655) to the new API:

- Two-branch `FileNamesGenerator` construction (transactional vs. non-transactional,
  reading location from JSON) → single construction using
  `persistent_components.path_resolver.getTableLocation()` with
  `is_transactional` as `use_uuid_in_metadata`.
- `generateMetadataName()` → `generateMetadataPathWithInfo()`, returning
  `GeneratedMetadataFileWithInfo { path, version, compression_method }`.
- Storage paths derived via `path_resolver.resolve(path)` where raw strings
  were previously returned.
- Catalog filename via `path_resolver.resolveForCatalog(path)`, replacing the
  manual `location + metadata_name` concatenation.
- `generateManifestList(filename_generator, ...)` → `generateManifestList(path_resolver, ...)`.

Behavior of truncate is preserved: clears data files by writing a new
snapshot with empty manifests and updates the catalog pointer. The new
`IcebergPathResolver` abstraction handles both transactional (full URI)
and non-transactional (bare path) cases transparently.

Refs: #1785, #1655
zvonand added a commit to Altinity/ClickHouse that referenced this pull request Jun 1, 2026
…100420

Backport of ClickHouse#100420/ClickHouse#100295 - Resolve problems with paths and compatibility problems with Spark in Azure (v2)
il9ue pushed a commit to il9ue/ClickHouse that referenced this pull request Jun 1, 2026
…_spark_azure_fixes

Resolve problems with paths and compatibility problems with Spark in Azure (v2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants