Support specifying or auto-assigning Parquet field_ids for output columns by Onyx2406 · Pull Request #101783 · ClickHouse/ClickHouse · GitHub
Skip to content

Support specifying or auto-assigning Parquet field_ids for output columns#101783

Open
Onyx2406 wants to merge 47 commits into
ClickHouse:masterfrom
Onyx2406:feat/parquet-column-field-id
Open

Support specifying or auto-assigning Parquet field_ids for output columns#101783
Onyx2406 wants to merge 47 commits into
ClickHouse:masterfrom
Onyx2406:feat/parquet-column-field-id

Conversation

@Onyx2406

@Onyx2406 Onyx2406 commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Two related knobs for writing Parquet files with column field_id metadata — needed for Apache Iceberg compatibility, which identifies columns by field_id rather than by name.

Both knobs can be used independently or together.

1. Explicit per-column overrides

New setting output_format_parquet_column_field_ids : Map(String, Int32).

SET output_format_parquet_column_field_ids = '{"col_a": 10, "col_b": 20, "col_c": 30}';

INSERT INTO FUNCTION file('data.parquet')
SELECT 1::UInt32 AS col_a, 'x'::String AS col_b, 42::Int64 AS col_c;

Each output column gets the id specified in the map. The map must cover every output column (unless auto-assign below is enabled to fill the gaps), and ids must be unique; violations raise BAD_ARGUMENTS with specific messages.

2. Auto-assign (Iceberg writer convention)

New setting output_format_parquet_auto_assign_field_ids : Bool, default false.

When enabled, every output column is assigned a unique sequential field_id starting at 1, matching the convention used by Apache Iceberg writers.

SET output_format_parquet_auto_assign_field_ids = 1;

INSERT INTO FUNCTION file('iceberg.parquet')
SELECT 1 AS a, 'x' AS b, 42 AS c;
-- a -> 1, b -> 2, c -> 3

3. Mixed mode

The two settings compose: override-map entries win for the columns they mention; auto-assign fills the remaining columns with the smallest unused positive ids.

SET output_format_parquet_auto_assign_field_ids = 1,
    output_format_parquet_column_field_ids = '{"b": 1}';

INSERT INTO FUNCTION file('mixed.parquet')
SELECT 1 AS a, 2 AS b, 3 AS c;
-- b -> 1 (override), a -> 2, c -> 3 (auto-assign skipping 1)

Implementation

  • src/Core/FormatFactorySettings.h — declare both settings.
  • src/Formats/FormatSettings.h — carry a std::vector<std::pair<String, Int32>> of overrides and the auto_assign_field_ids bool; no string parsing on the hot path.
  • src/Formats/FormatFactory.cpp — convert the Map setting into the pair vector, with structural validation (tuple shape, string keys, integer values in Int32 range).
  • src/Processors/Formats/Impl/ParquetBlockOutputFormat.cppbuildColumnFieldIds() resolves overrides against the actual output header, auto-assigns unused ids if the flag is on, and rejects unknown columns, duplicate ids, and non-covering overrides.

Test

tests/queries/0_stateless/04321_parquet_column_field_ids.sh writes Parquet files with clickhouse-local, reads the field_id metadata back with pyarrow, and covers:

  • Explicit per-column overrides.
  • Auto-assign only.
  • Mixed override + auto-assign.
  • Writing with neither setting (no field_ids, unchanged behavior).
  • Nested types: auto-assign and dotted-path overrides for Array.element, Tuple subfields, and Map.key/Map.value.
  • Geo columns written as WKB under GeoParquet (the Point/Array/Tuple shape collapses to a single top-level field).
  • Error cases: unknown column, non-covering map (top-level and nested), duplicate id, non-integer value, negative id, and a dotted top-level name colliding with a nested path (in both auto-assign and override modes).

Closes: #58753

Changelog category (leave one):

  • New Feature

Changelog entry:

Added output_format_parquet_column_field_ids (Map(String, Int32)) to set explicit Parquet field_id overrides per column and output_format_parquet_auto_assign_field_ids (Bool) to auto-assign sequential field_ids to every output column, matching the Apache Iceberg writer convention. The two settings can be combined.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Note

Medium Risk
Touches the Parquet write path and schema generation to emit field_id metadata, which could affect interoperability and schema expectations; defaults are preserved unless new settings are enabled.

Overview
Adds two new Parquet output settings to control column field_id metadata for Iceberg compatibility: output_format_parquet_column_field_ids (explicit per-column overrides) and output_format_parquet_auto_assign_field_ids (sequential auto-assignment starting at 1).

FormatFactory now parses/validates the Map setting into typed overrides up-front, while ParquetBlockOutputFormat resolves the final field_id mapping (including error checks for unknown columns, duplicates, negative ids, and incomplete coverage when auto-assign is off) and explicitly rejects using these settings when a datalake writer provides its own column-id mapping.

Documentation and a new stateless test are added to verify correct field_id emission via pyarrow and expected failure modes.

Reviewed by Cursor Bugbot for commit ce5f371. Bugbot is set up for automated code reviews on this repo. Configure here.

Onyx2406 added 3 commits April 4, 2026 18:39
Add `output_format_parquet_column_field_ids` setting to specify Parquet
field IDs per column (e.g., 'col_a: 10, col_b: 20'). This is needed
for Apache Iceberg compatibility.

Fixes ClickHouse#58753
…ring

- Restore accidentally deleted `query_plan_optimize_join_order_randomize`
  entry in the 26.4 block of `SettingsChangesHistory.cpp`
- Rename test 04069 to 04077 to avoid collision with
  `04069_index_type_with_settings`
- Update Parquet file prefixes inside the test SQL from `04076_` to
  `04077_` to match the new test number
- Change "non-negative integers" to "integers" in the
  `output_format_parquet_column_field_ids` description since Parquet
  uses signed int32 for field_id
Comment thread src/Core/FormatFactorySettings.h Outdated
@alexey-milovidov alexey-milovidov self-assigned this Apr 5, 2026

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@alexey-milovidov

Copy link
Copy Markdown
Member

NOTE: the code for this change is folded into the preceding merge commit
`d66467fbf65` because it touched conflict-resolution lines in
`FormatSettings.h`. This empty commit is a marker that describes the
intent; the actual diff lives in that merge commit.

Rework per alexey's review (ClickHouse#101783):

1. `output_format_parquet_column_field_ids` is now `Map(String, Int32)`
   instead of a `String` mini-language (`'a: 1, b: 2'`). Users configure it
   as structured data, matching how `http_response_headers` and
   `additional_table_filters` are defined.
2. New `output_format_parquet_auto_assign_field_ids` (`Bool`, default
   `false`). When on, every output column gets a sequential Parquet
   `field_id` starting at 1 (Iceberg writer convention). Any column
   present in `output_format_parquet_column_field_ids` keeps its
   overridden id; auto-assignment fills the remaining columns and skips
   ids already taken by overrides.
3. The parquet output format validates overrides against the actual
   output header: unknown columns, duplicate ids, and non-covering
   overrides (when auto-assign is off) all error with BAD_ARGUMENTS.
4. Test (`04080_parquet_column_field_ids.sql`) rewritten for the new
   JSON map syntax and extended to cover auto-assign, mixed
   override+auto-assign, and the new error paths.
@Onyx2406 Onyx2406 changed the title Support specifying field_id for columns when writing Parquet files Support specifying or auto-assigning Parquet field_ids for output columns Apr 18, 2026
alexey-milovidov and others added 5 commits May 12, 2026 02:07
…-field-id

# Conflicts:
#	src/Formats/FormatFactory.cpp
#	src/Formats/FormatSettings.h
#	src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp
The PR was originally written against `26.4`, but master has since
opened a `26.5` block. New settings introduced by an unmerged PR
belong to the current development version, so move
`output_format_parquet_column_field_ids` and
`output_format_parquet_auto_assign_field_ids` accordingly. Also wrap
literal names in inline code in the descriptions for consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old Arrow-based Parquet writer was removed in master (commit
80f3bb8, `Remove old Parquet writer and reader implementations`),
so `output_format_parquet_use_custom_encoder` is now obsolete and the
two writer paths the test exercised have merged into one. Drop the
duplicated arrow-encoder test case and tighten the comment; the
remaining cases still cover overrides, auto-assign, mixed mode, the
no-setting default, and the three error conditions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`output_format_parquet_column_field_ids` is declared as a `Map` and
included by every storage engine that pulls in
`LIST_OF_ALL_FORMAT_SETTINGS`. Each engine forwards its setting types
through its own `_SETTINGS_SUPPORTED_TYPES` list, so without adding
`Map` to that list the `KafkaSettings##TYPE` (and equivalent)
helper-name expansion fails to compile with `unknown type name
'KafkaSettingsMap'`. Add `Map` to all nine such engines
(`Kafka`, `Hive`, `Set`, `RabbitMQ`, `FileLog`, `StorageObjectStorage`,
`ObjectStorageQueue`, `NATS`, `DataLakeStorage`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… setting

`SettingFieldMap` always stores `Map(String, String)`, so the value
field always arrives as a `String` and never as `Int64`/`UInt64`. The
original parser dispatched on `Int64`/`UInt64` and rejected everything
else, which made every well-formed `SET output_format_parquet_column_field_ids = ...`
throw `BAD_ARGUMENTS: values must be integers`.

Parse the string value with `tryParse<Int64>` instead, and keep the
existing range check that rejects negatives and values above
`Int32::max`. Tighten the test to use the canonical `Map(String, String)`
literal syntax (`{'a': '10', ...}`) instead of JSON inside a string,
add a case covering a non-integer value, and update the setting's
docstring example to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 12, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [6322069]

Summary:

job_name test_name status info comment
Integration tests (amd_asan_ubsan, flaky) FAIL
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1] FAIL cidb
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2] FAIL cidb
Integration tests (amd_asan_ubsan, db disk, old analyzer, 4/6) FAIL
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1] FAIL cidb
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2] FAIL cidb
Integration tests (arm_binary, distributed plan, 3/4) FAIL
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1] FAIL cidb
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2] FAIL cidb
Integration tests (amd_tsan, 4/6) FAIL
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1] FAIL cidb
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2] FAIL cidb
Integration tests (amd_msan, 2/8) FAIL
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1] FAIL cidb
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2] FAIL cidb
Integration tests (amd_llvm_coverage, 7/8) FAIL
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1] FAIL cidb
test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2] FAIL cidb
Stress test (arm_tsan) FAIL
Logical error: Not-ready Set is passed as the second argument for function 'A (STID: 0250-41a5) FAIL cidb, issue

AI Review

Summary

This PR adds Parquet field_id controls for Iceberg compatibility: explicit per-path overrides, sequential auto-assignment, nested-field handling, docs, and tests. The core writer changes look consistent with the current schema-building logic, but one claimed regression test still does not exercise the datalake guard it is supposed to cover, so the review remains blocked on that gap.

Findings

⚠️ Majors

  • [tests/integration/test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py:50-61] The new Iceberg regression test still does not reach ParquetBlockOutputFormat.cpp:195. Engine-backed Iceberg writes freeze FormatSettings at CREATE TABLE time and ignore per-INSERT session/user format settings, so these two inserts bypass the guard entirely instead of proving that datalake metadata remains authoritative for field_ids. Rewrite the test to put the setting into CREATE TABLE ... SETTINGS ... / additional_settings, or route the write through INSERT INTO FUNCTION icebergLocal(...), where query-time format settings are actually used.
Final Verdict

Request changes. The writer logic is close, but the remaining regression test for the datalake guard is still testing the wrong surface and does not prove the invariant it claims to cover.

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label May 12, 2026
`getFormatSettings` runs on every query, including SELECTs that don't
write Parquet, so parsing the `output_format_parquet_column_field_ids`
value to `Int32` in `FormatFactory` made `SETTINGS
output_format_parquet_column_field_ids = {'a': 'oops'}` fail
client-side with `Error on processing query: ...`. Tests using `{
serverError BAD_ARGUMENTS }` then saw a client error instead of a
server error and reported "Expected server error ... but got no server
error".

Keep the raw `Map(String, String)` values in `FormatSettings` and do
the integer parse + range check inside `ParquetBlockOutputFormat`'s
`buildColumnFieldIds`, alongside the existing unknown-column /
duplicate-id checks. The same error now reaches the test framework as
a normal server error.

Also fix the test so it's re-runnable under the flaky check: switch
the success cases from `INSERT INTO FUNCTION file('<fixed-name>')` to
`currentDatabase() || ...` and set `engine_file_truncate_on_insert =
1`, so a second run doesn't trip `CANNOT_APPEND_TO_FILE`. Convert the
four error cases from `INTO OUTFILE` to `INSERT INTO FUNCTION
file(...)` so the validation errors arrive as server errors that `{
serverError BAD_ARGUMENTS }` recognises.

See CI report:
ClickHouse#101783

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp Outdated
Comment thread tests/queries/0_stateless/04080_parquet_column_field_ids.sql Outdated
alexey-milovidov and others added 7 commits May 18, 2026 21:30
…ataLake settings

The PR adds an `output_format_parquet_column_field_ids` setting of type
`Map(String, Int32)`. Master subsequently moved to a constexpr settings
layout that requires every type used by any setting in the merged list to
appear in the supported-types macro for that settings class.

Without this fix the build fails in `DatabaseDataLakeSettings.cpp` with a
VLA diagnostic, because the layout computation hits an undeclared
`SettingTypeTag` value for `Map`.

Fast-test build report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/101783/463edd23905aa45f867156b09f6e7e3eadc02666/fast_test/build_clickhouse/build_clickhouse.log

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…actory time

Addresses review feedback on PR ClickHouse#101783:

* ClickHouse#101783 (comment) ("Don't use mini-languages inside strings. Instead, use structured data like Map.")
* ClickHouse#101783 (comment) (datalake column-id mapping precedence)

`FormatSettings::Parquet::column_field_ids` now stores `Int32` directly
instead of `String`, and `FormatFactory` parses the `Map` setting at SET
time. Integer values can be passed natively (`{'a': 1, 'b': 2}`) instead of
encoded as strings; quoted-integer strings are still accepted for
compatibility.

When a datalake (e.g. Iceberg) writer hands `ParquetBlockOutputFormat` a
column-id mapping, that mapping is now authoritative; combining it with
`output_format_parquet_column_field_ids` or
`output_format_parquet_auto_assign_field_ids` raises `BAD_ARGUMENTS`
instead of silently producing a Parquet file whose `field_id`s no longer
match table metadata.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses ClickHouse#101783 (comment):
the previous SQL test only round-tripped values, so a regression that dropped
`field_id` metadata from the Parquet schema would not have been caught.

The test is now a shell script that uses pyarrow to read the written Parquet
schema and print `field_id`s for each column. Covers explicit overrides,
auto-assign, mixed override + auto-assign, the default (no `field_id`) path,
string-encoded ids for compatibility, and the error cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickHouse's settings parser only accepts string literals as `Map` values
(integer literals fail with `SYNTAX_ERROR`), so the user-visible setting is
`output_format_parquet_column_field_ids = {'col_a': '1', ...}`. Update the
test to use that syntax everywhere and the docs / setting description to
match.

The local C++ side still accepts both `Int64`/`UInt64` (for programmatic
callers) and `String` (for SQL callers), so no implementation change is
needed.

Verified end-to-end with the local build — all five round-trip cases and
all five error cases match the reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The push to the fork is blocked by the missing `workflow` OAuth scope. Restore
`.github/workflows/pull_request.yml` to the pre-merge state so the branch can
be pushed; the next maintainer-driven sync to master will pull the latest
workflow definition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh clickhouse-gh Bot added the manual approve Manual approve required to run CI label May 18, 2026
…-field-id

# Conflicts:
#	.github/workflows/pull_request.yml
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged the latest master (the branch was ~306 commits behind, 0 conflicts — the only file conflict, Mutations.cpp arm_tidy init-vars, was already fixed identically on master via #106102).

Renumbered the test 04309_parquet_column_field_ids04321_parquet_column_field_ids: the master merge brought in 04309_explain_insert_format, which collided with the 04309_ prefix (04321 is the next free number). Updated the internal temporary-file names accordingly.

Rebuilt locally (clean) and ran 04321_parquet_column_field_ids against the freshly-built binary — exact reference match (explicit overrides, auto-assign, mixed mode, default, nested Array/Map/Tuple auto-assign, dotted nested overrides, GeoParquet WKB collapse, and all the BAD_ARGUMENTS error cases). All review threads are resolved.

The CI reds are all unrelated to this Parquet-output-only change: the Performance Comparison failures (rand, logical_functions_small, async_remote_read) are flagged on both ::old and ::new, i.e. measurement noise, not a regression.

@alexey-milovidov

Copy link
Copy Markdown
Member

Merged latest master to refresh the stale base (the CI red was entirely 0s aborted jobs from the outdated merge base, not real failures). Clean merge, no conflicts; test 04321_parquet_column_field_ids does not collide with anything new on master.

Rebuilt clickhouse (clean) and re-ran the test against a fresh server — passes:

[1 / 1] 04321_parquet_column_field_ids: [ OK ] 4.79 sec.

All 6 review threads are resolved. CI should now run on a current base.

Comment thread src/Core/SettingsChangesHistory.cpp Outdated
{"enable_streaming_queries", false, false, "New setting"},
{"optimize_prewhere_after_pushdown", false, false, "New setting that enables a second PREWHERE promotion pass to merge filters deposited above a MergeTree read step by later optimizations (predicate pushdown through JOIN, projection rewrites) into the existing PREWHERE chain."},
{"output_format_parquet_column_field_ids", "", "", "New setting to specify explicit Parquet `field_id` overrides per column (`Map(String, Int32)`), useful for Iceberg compatibility."},
{"output_format_parquet_auto_assign_field_ids", false, false, "New setting to auto-assign sequential Parquet `field_id`s to every output column, Iceberg-style."},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not true by default?

Also a question - if we don't auto-assign field ids, what is wrong with that? - why field ids are even needed?

Comment thread src/Core/SettingsChangesHistory.cpp Outdated
{"show_remote_databases_in_system_tables", false, false, "Renamed from `show_data_lake_catalogs_in_system_tables` and broadened to also hide `MySQL` and `PostgreSQL` databases from `system.tables`, `system.columns` and `system.completions` by default, since enumerating their tables requires expensive remote calls. Users who relied on the previous behavior must set this setting to `true`. The old name is kept as an alias."},
{"enable_streaming_queries", false, false, "New setting"},
{"optimize_prewhere_after_pushdown", false, false, "New setting that enables a second PREWHERE promotion pass to merge filters deposited above a MergeTree read step by later optimizations (predicate pushdown through JOIN, projection rewrites) into the existing PREWHERE chain."},
{"output_format_parquet_column_field_ids", "", "", "New setting to specify explicit Parquet `field_id` overrides per column (`Map(String, Int32)`), useful for Iceberg compatibility."},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario when a user needs to assign field ids manually? It sounds cumbersome.


if (const auto * array_type = typeid_cast<const DataTypeArray *>(inner.get()))
{
enumerateFieldPaths(name + ".element", array_type->getNestedType(), out);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. Here we introduced a new notation for almost no reason. If there is no better way, I will close this PR and withdraw the task.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we are okay with always assign field ids automatically, we can remove this code.

else if (const auto * tuple_type = typeid_cast<const DataTypeTuple *>(inner.get()))
{
for (size_t i = 0; i < tuple_type->getElements().size(); ++i)
enumerateFieldPaths(name + "." + tuple_type->getNameByPosition(i + 1), tuple_type->getElement(i), out);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsafe.

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the manual assign feature, all ok.

@alexey-milovidov

Copy link
Copy Markdown
Member

Ok. I understand why manual assignment is needed.
But I don't like ad-hoc invented substream names.

@alexey-milovidov

Copy link
Copy Markdown
Member

If we make substream names exactly as they are in ClickHouse (so we don't invent anything for this task), it will be good.

@alexey-milovidov

Copy link
Copy Markdown
Member

Apparently, the substream names are this way in Parquet :(

namespace
{

/// Whether `prepareGeoColumn` will collapse this top-level column into a single WKB `String`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the code complex?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me look

alexey-milovidov and others added 2 commits June 18, 2026 09:40
`convertSchema` validated the user-supplied Iceberg `field_id` map against the
original ClickHouse column type, but for a recognized geo custom type (`Point`,
`LineString`, `Polygon`, `MultiLineString`, `MultiPolygon`, `Geometry`)
`prepareGeoColumn` later collapses the column into a single WKB `String` field
when `output_format_parquet_geometadata` is enabled (the default). So
`validateIcebergFieldIds` walked the pre-WKB `Tuple` / `Array` shape and demanded
nested `field_id`s (e.g. `point.x`, `point.y`) that the writer never emits and
that `buildColumnFieldIds` never produces — wrongly rejecting a valid top-level
override and breaking auto-assign for geo columns.

For example, `SELECT (1.0, 2.0)::Point AS point SETTINGS
output_format_parquet_column_field_ids = {'point': '7'}` (or auto-assign) failed
with `INCORRECT_DATA` even though the emitted Parquet schema has only the
top-level `point` WKB field.

Fix: share the geo-scalar rule between the `field_id` builder and the validator.
`isGeoColumnWrittenAsWKBScalar` moves from an anonymous namespace in
`ParquetBlockOutputFormat.cpp` to `DB::Parquet` (declared in `Write.h`), and
`convertSchema` now validates a geo-WKB-scalar column as a single top-level field
instead of recursing into its nested shape — matching the schema actually
written and the map `buildColumnFieldIds` builds.

CI report (before fix): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=8ef07d4eb682a8ca2c29f3d81fc9d2ba557df4db

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
/// When a datalake (e.g. Iceberg) writer hands us a column-id mapping, that mapping is
/// the source of truth — letting session settings override it would produce Parquet files
/// whose `field_id`s no longer match the table metadata, breaking subsequent reads.
if (has_metadata_mapping && user_set_field_ids)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard protects the important invariant that datalake metadata remains the source of truth for field_ids, but I don't see regression coverage for it. The new 04321_parquet_column_field_ids test only writes direct Parquet files, so removing this check would still leave the current test green while reintroducing the path where session settings can override an Iceberg table's column-id mapping.

Please add a focused datalake/Iceberg write case that has format_filter_info_->column_mapper present and verifies that either output_format_parquet_column_field_ids or output_format_parquet_auto_assign_field_ids fails with BAD_ARGUMENTS.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guard at ParquetBlockOutputFormat.cpp:195 fires only when format_filter_info_->column_mapper is set, which happens exclusively when writing through a datalake (Iceberg) storage that provides its own column-id mapping. The stateless test writes via file() / clickhouse-local, which never has a column_mapper, so the guard is not reachable there. Covering it needs the Iceberg integration-test harness (write to a real datalake table with output_format_parquet_column_field_ids / output_format_parquet_auto_assign_field_ids set and assert BAD_ARGUMENTS); will add it as a follow-up integration test rather than expanding the stateless test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still doesn't cover the guard at ParquetBlockOutputFormat.cpp:195.

tests/integration/test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py:50-61 sets output_format_parquet_auto_assign_field_ids / output_format_parquet_column_field_ids on the INSERT, but StorageObjectStorage freezes FormatSettings for engine-backed Iceberg tables at CREATE TABLE time (src/Storages/ObjectStorage/registerStorageObjectStorage.cpp:54-69) and then reuses those stored settings on write (src/Storages/ObjectStorage/StorageObjectStorage.cpp:630-631). Because the per-INSERT settings never reach MultipleFileWriter/ParquetBlockOutputFormat, both inserts bypass this guard and the test still doesn't prove the invariant.

To make this a real regression test, either put the setting on the table definition (CREATE TABLE ... SETTINGS ..., via additional_settings in create_iceberg_table) or write through the icebergLocal(...) table function, where query-time format settings are resolved on the write path.

alexey-milovidov and others added 3 commits June 26, 2026 00:52
…sHistory

After merging `master`, the `02995_new_settings_history` baseline was bumped
to 26.6, so settings recorded in the `26.6` block of `SettingsChangesHistory.cpp`
are now treated as part of the released baseline rather than as new in the
current development cycle. As a result, the check started reporting:

    PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: output_format_parquet_column_field_ids WAS ADDED
    PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: output_format_parquet_auto_assign_field_ids WAS ADDED

Move `output_format_parquet_column_field_ids` and
`output_format_parquet_auto_assign_field_ids` from the `26.6` block to the
`26.7` block so they are recognized as newly added settings, fixing the
`02995_new_settings_history` check.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=621a9fdc7750a62bd2383c8b9c8e319ccbb31e28&name_0=PR&name_1=Fast%20test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp
Comment thread docs/en/interfaces/formats/Parquet/Parquet.md Outdated
alexey-milovidov and others added 2 commits June 30, 2026 05:40
…d_ids

Address review feedback: the `output_format_parquet_column_field_ids` docs
row described explicit overrides but did not state that, when
`output_format_parquet_auto_assign_field_ids` is disabled, a non-empty map
must cover every output column and every nested field. As written, a user
could try a partial map like `{'a': '1'}` and only discover at runtime that
`buildColumnFieldIds` raises `BAD_ARGUMENTS`. Spell out the full-coverage
requirement and point to `output_format_parquet_auto_assign_field_ids` as the
way to fill the remaining paths automatically.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Pushed 9cd4798c24c..d582ec045d9:

  • Merged latest master (the branch was ~994 commits behind and CONFLICTING). The merge was clean — no conflicts. The test 04321_parquet_column_field_ids does not collide with anything new on master (the 04321 numeric prefix is shared by several unrelated tests, but the full filename is unique), and both new settings remain correctly registered under the current dev 26.7 block in SettingsChangesHistory.cpp. Re-ran -fsyntax-only on ParquetBlockOutputFormat.cpp, PrepareForWrite.cpp, and FormatFactory.cpp against the merged tree — all clean.
  • Documented the explicit-map full-coverage rule (d582ec045d9): the output_format_parquet_column_field_ids docs row now states that, with output_format_parquet_auto_assign_field_ids disabled, a non-empty map must cover every output column and every nested field (otherwise BAD_ARGUMENTS), and that auto-assign fills the remaining paths. Resolves the open docs thread.
  • Updated the stale PR description: it referenced the old 04080_parquet_column_field_ids.sql and a non-existent "Arrow encoder path"; it now points to 04321_parquet_column_field_ids.sh with the actual coverage, and uses Closes: per convention.

The two remaining AI-review findings on the datalake path are addressed in their threads:

  • The dotted-path collision concern on the datalake metadata mapping (getStorageColumnEncoding()convertSchema) is pre-existing behavior on master — this PR does not touch that path; it only adds the user-setting path with its own collision check.
  • The missing regression coverage for the datalake guard is not reachable from a stateless test (the column_mapper is only present for actual Iceberg/datalake writes); flagged for a follow-up integration test.

The previous CI reds are unrelated, known master-wide flakes (not touched by this Parquet-output change):

  • Stress test (arm_ubsan): Logical error: 'Unexpected exception in refresh scheduling' in RefreshTask::doScheduling — the chronic RefreshTask flake (#106737 family), 328 failures across 217 PRs (37 on master) in the last 30 days.
  • Upgrade check (amd_release): PostgreSQLConnectionPool connection to 192.0.2.1 (a non-routable TEST-NET-1 address) timing out in DatabasePostgreSQL::removeOutdatedTables() — the recurring upgrade-check PostgreSQL/MySQL connection flake.

CI re-runs on the fresh base after this push.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=latest&name_0=PR

`ParquetBlockOutputFormat` rejects `output_format_parquet_column_field_ids` /
`output_format_parquet_auto_assign_field_ids` when a datalake writer (Iceberg)
supplies its own column-id mapping via `format_filter_info->column_mapper`,
because that mapping is the authoritative source for the Parquet `field_id`s and
letting the session settings override it would emit `field_id`s that no longer
match the table metadata.

That guard is not reachable from the stateless `04321_parquet_column_field_ids`
test, which writes via `file` / `clickhouse-local` and therefore never has a
`column_mapper`. This adds a focused integration test in
`test_storage_iceberg_no_spark` that writes to an Iceberg table with each setting
enabled and asserts the write fails with `BAD_ARGUMENTS`, so removing the guard
would no longer leave CI green.

Addresses the AI Review "Request changes" verdict and the review thread on
`ParquetBlockOutputFormat.cpp` asking for coverage of this guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Pushed d582ec045d9..f44537142bb.

Added the datalake-guard regression test that the AI Review "Request changes" verdict (and the ParquetBlockOutputFormat.cpp:195 thread) asked for — implementing the follow-up integration test mentioned in that thread.

New test tests/integration/test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py:

  • Writes to an Iceberg table — the only path that hands ParquetBlockOutputFormat a non-null format_filter_info->column_mapper (via MultipleFileWriter::startNewFile, which builds the mapper from the Iceberg schema).
  • Asserts that an INSERT with output_format_parquet_auto_assign_field_ids = 1, and one with output_format_parquet_column_field_ids = {'x': '1'}, each fail with BAD_ARGUMENTS (... cannot be used when writing to a datalake table that provides its own column-id mapping).
  • Confirms a plain INSERT still succeeds and that the rejected inserts commit no rows.

This closes the gap the stateless 04321_parquet_column_field_ids test cannot reach (it writes via file / clickhouse-local, which never has a column_mapper), so dropping the guard would now turn CI red instead of leaving it green. I used the no_spark suite because allow_insert_into_iceberg is enabled cluster-wide there and the table is created and written entirely by ClickHouse (no Spark dependency), matching test_writes_multiple_files.py.

CI: the only red on the prior commit was Unit tests (msan, function_prop_fuzzer)FunctionsStress.stress on SELECT reinterpret(CAST(-1012.7 AS Decimal(9, 1)), CAST('Decimal32(4)' AS String)) — the random-function property fuzzer, tracked by #107242, unrelated to this Parquet-output-only change.

The remaining open threads are maintainer design calls (the manual-assign rationale; the substream-name notation, which matches Parquet's own; and the pre-existing dotted-path behavior on the datalake metadata path) — left for @alexey-milovidov.

@alexey-milovidov

Copy link
Copy Markdown
Member

Heads-up on the two CI reds on f4453714 — neither is a Parquet-encoding regression.

1. The new datalake-guard test test_writes_reject_field_id_settings is red, and it stays red with the guard in place — so the assumption in my previous comment ("dropping the guard would now turn CI red instead of leaving it green") is actually inverted. The guard at ParquetBlockOutputFormat.cpp:195 never runs here, because the per-INSERT field-id settings never reach the Iceberg write path:

  • create_iceberg_table("local", …) produces a CREATE TABLE … ENGINE=IcebergLocal(…). For a storage engine, format_settings is frozen at table-creation time from the CREATE … SETTINGS clause plus global defaults — "Use format settings from global server context + settings from the SETTINGS clause of the create query. Settings from current session and user are ignored." (registerStorageObjectStorage.cpp:54-70).
  • The write path passes that frozen member straight through (StorageObjectStorage::write, StorageObjectStorage.cpp:612); unlike the read path it does not re-derive from the query context, and getOutputFormatParallelIfPossible then uses it verbatim because it is non-nullopt.
  • So output_format_parquet_auto_assign_field_ids / output_format_parquet_column_field_ids set on the INSERT never land in format_settings.parquet. user_set_field_ids is false, the guard is skipped, the INSERT succeeds, and query_and_get_error raises "Client expected to be failed but succeeded!".

The practical consequence is larger than the test: these settings can only reach an Iceberg write through the table's CREATE … SETTINGS clause (or a profile/global default), never per-query. So the "session setting silently overrides the datalake field_id mapping" scenario the guard defends against cannot happen via session settings at all — and if the settings are baked into the table, the guard would fire on every write to it.

The fix direction depends on the design call in the open threads:

  • keep the guard → the test should set the field-id settings in CREATE TABLE … SETTINGS (a separate table from the plain-insert one), where they reach the format and the write is rejected;
  • drop the manual-assign path / always auto-assign → the guard and this test go away with it.

I left it rather than rewrite the test, since it's tied to that decision rather than a mechanical fix.

2. Unit tests (asan_ubsan, function_prop_fuzzer)FunctionsStress.stress is the unrelated master-wide property-fuzzer failure tracked by #107242 (here it surfaced a reinterpret(CAST(… AS Decimal(9, 7)), 'Decimal32(5)') scale mismatch — function returned column of unexpected type or scale: Const(Decimal32) instead of Decimal(9, 5)). It fails on master (pull_request_number = 0) and ~25 other PRs the same day; nothing to do with Parquet output.

@alexey-milovidov

Copy link
Copy Markdown
Member

Merged master into the branch (f44537142bb..63220697583) — clean auto-merge, the net diff is unchanged (still the same 23 files, +681/-5; the two SettingsChangesHistory.cpp / FormatFactorySettings.h entries survived). This clears the stale CONFLICTING state.

The only PR-caused red is the new test_writes_reject_field_id_settings.py (it fails deterministically across every integration shard — flaky, amd_tsan, amd_msan, arm_binary, …). The other reds are unrelated: FunctionsStress.stress (#107242), Hung check (#107941), arm_release Unexpected return type decorrelation flake, and the AST fuzzer one is a heap-buffer-overflow in AggregateFunctionVarianceSimple::add — an aggregate-function fuzzer find, nothing to do with Parquet output.

Root cause of the test failure — the guard is unreachable on the path the test uses. The test creates a table engine (create_iceberg_tableENGINE = IcebergLocal(...), no SETTINGS clause) and then does INSERT ... SETTINGS output_format_parquet_auto_assign_field_ids = 1. But for object-storage engines the FormatSettings are frozen at CREATE TABLE time — registerStorageObjectStorage.cpp is explicit: "Use format settings from global server context + settings from the SETTINGS clause of the create query. Settings from current session and user are ignored." So the per-INSERT setting never reaches the writer: in the ParquetBlockOutputFormat ctor user_set_field_ids is false, the has_metadata_mapping && user_set_field_ids guard doesn't fire, the INSERT succeeds, and query_and_get_error reports "Client expected to be failed but succeeded!".

A corollary that's directly relevant to the still-open manual-assign discussion: on the natural SET …; INSERT INTO <iceberg table> path these settings are silently (and safely) ignored — the Iceberg column_mapper is always authoritative there — so there is nothing to corrupt on that path, and dropping the guard would not turn this test red (it's red today with the guard present).

Two ways to make the test actually exercise the guard, if we keep it:

  • Freeze the setting into the engine: CREATE TABLE t (...) ENGINE = IcebergLocal(...) SETTINGS output_format_parquet_auto_assign_field_ids = 1 (then any INSERT trips it — needs a separate table from the plain-insert case). The DataLake settings collection embeds format settings (that's the M(CLASS_NAME, Map) line this PR adds to DataLakeStorageSettings.h).
  • Or write via the table function INSERT INTO FUNCTION icebergLocal(...) SETTINGS output_format_parquet_auto_assign_field_ids = 1 SELECT ... — table functions pass format_settings = std::nullopt, so they resolve from the query context at write time and the guard does fire.

I left the fix to you: whether to keep the guard at all is entangled with the manual-assign notation thread you're still deciding, and I couldn't run the integration suite in this environment to confirm which of the two rewrites is cleanest for the no_spark suite.

@clickhouse-gh

clickhouse-gh Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.50% +0.00%
Functions 92.70% 92.60% -0.10%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 144/176 (81.82%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support specifying field_id for columns when writing to Parquet files

3 participants