Support specifying or auto-assigning Parquet field_ids for output columns#101783
Support specifying or auto-assigning Parquet field_ids for output columns#101783Onyx2406 wants to merge 47 commits into
Conversation
Add `output_format_parquet_column_field_ids` setting to specify Parquet field IDs per column (e.g., 'col_a: 10, col_b: 20'). This is needed for Apache Iceberg compatibility. Fixes ClickHouse#58753
…ring - Restore accidentally deleted `query_plan_optimize_join_order_randomize` entry in the 26.4 block of `SettingsChangesHistory.cpp` - Rename test 04069 to 04077 to avoid collision with `04069_index_type_with_settings` - Update Parquet file prefixes inside the test SQL from `04076_` to `04077_` to match the new test number - Change "non-negative integers" to "integers" in the `output_format_parquet_column_field_ids` description since Parquet uses signed int32 for field_id
…er collision with upstream master
…lickHouse#101994, ClickHouse#102158, ClickHouse#101952, ClickHouse#102148, ClickHouse#102008, ClickHouse#102010) # Conflicts: # src/Formats/FormatSettings.h
NOTE: the code for this change is folded into the preceding merge commit `d66467fbf65` because it touched conflict-resolution lines in `FormatSettings.h`. This empty commit is a marker that describes the intent; the actual diff lives in that merge commit. Rework per alexey's review (ClickHouse#101783): 1. `output_format_parquet_column_field_ids` is now `Map(String, Int32)` instead of a `String` mini-language (`'a: 1, b: 2'`). Users configure it as structured data, matching how `http_response_headers` and `additional_table_filters` are defined. 2. New `output_format_parquet_auto_assign_field_ids` (`Bool`, default `false`). When on, every output column gets a sequential Parquet `field_id` starting at 1 (Iceberg writer convention). Any column present in `output_format_parquet_column_field_ids` keeps its overridden id; auto-assignment fills the remaining columns and skips ids already taken by overrides. 3. The parquet output format validates overrides against the actual output header: unknown columns, duplicate ids, and non-covering overrides (when auto-assign is off) all error with BAD_ARGUMENTS. 4. Test (`04080_parquet_column_field_ids.sql`) rewritten for the new JSON map syntax and extended to cover auto-assign, mixed override+auto-assign, and the new error paths.
…-field-id # Conflicts: # src/Formats/FormatFactory.cpp # src/Formats/FormatSettings.h # src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp
The PR was originally written against `26.4`, but master has since opened a `26.5` block. New settings introduced by an unmerged PR belong to the current development version, so move `output_format_parquet_column_field_ids` and `output_format_parquet_auto_assign_field_ids` accordingly. Also wrap literal names in inline code in the descriptions for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old Arrow-based Parquet writer was removed in master (commit 80f3bb8, `Remove old Parquet writer and reader implementations`), so `output_format_parquet_use_custom_encoder` is now obsolete and the two writer paths the test exercised have merged into one. Drop the duplicated arrow-encoder test case and tighten the comment; the remaining cases still cover overrides, auto-assign, mixed mode, the no-setting default, and the three error conditions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`output_format_parquet_column_field_ids` is declared as a `Map` and included by every storage engine that pulls in `LIST_OF_ALL_FORMAT_SETTINGS`. Each engine forwards its setting types through its own `_SETTINGS_SUPPORTED_TYPES` list, so without adding `Map` to that list the `KafkaSettings##TYPE` (and equivalent) helper-name expansion fails to compile with `unknown type name 'KafkaSettingsMap'`. Add `Map` to all nine such engines (`Kafka`, `Hive`, `Set`, `RabbitMQ`, `FileLog`, `StorageObjectStorage`, `ObjectStorageQueue`, `NATS`, `DataLakeStorage`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… setting
`SettingFieldMap` always stores `Map(String, String)`, so the value
field always arrives as a `String` and never as `Int64`/`UInt64`. The
original parser dispatched on `Int64`/`UInt64` and rejected everything
else, which made every well-formed `SET output_format_parquet_column_field_ids = ...`
throw `BAD_ARGUMENTS: values must be integers`.
Parse the string value with `tryParse<Int64>` instead, and keep the
existing range check that rejects negatives and values above
`Int32::max`. Tighten the test to use the canonical `Map(String, String)`
literal syntax (`{'a': '10', ...}`) instead of JSON inside a string,
add a case covering a non-integer value, and update the setting's
docstring example to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [6322069] Summary: ❌
AI ReviewSummaryThis PR adds Parquet Findings
Final VerdictRequest changes. The writer logic is close, but the remaining regression test for the datalake guard is still testing the wrong surface and does not prove the invariant it claims to cover. |
`getFormatSettings` runs on every query, including SELECTs that don't
write Parquet, so parsing the `output_format_parquet_column_field_ids`
value to `Int32` in `FormatFactory` made `SETTINGS
output_format_parquet_column_field_ids = {'a': 'oops'}` fail
client-side with `Error on processing query: ...`. Tests using `{
serverError BAD_ARGUMENTS }` then saw a client error instead of a
server error and reported "Expected server error ... but got no server
error".
Keep the raw `Map(String, String)` values in `FormatSettings` and do
the integer parse + range check inside `ParquetBlockOutputFormat`'s
`buildColumnFieldIds`, alongside the existing unknown-column /
duplicate-id checks. The same error now reaches the test framework as
a normal server error.
Also fix the test so it's re-runnable under the flaky check: switch
the success cases from `INSERT INTO FUNCTION file('<fixed-name>')` to
`currentDatabase() || ...` and set `engine_file_truncate_on_insert =
1`, so a second run doesn't trip `CANNOT_APPEND_TO_FILE`. Convert the
four error cases from `INTO OUTFILE` to `INSERT INTO FUNCTION
file(...)` so the validation errors arrive as server errors that `{
serverError BAD_ARGUMENTS }` recognises.
See CI report:
ClickHouse#101783
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ataLake settings The PR adds an `output_format_parquet_column_field_ids` setting of type `Map(String, Int32)`. Master subsequently moved to a constexpr settings layout that requires every type used by any setting in the merged list to appear in the supported-types macro for that settings class. Without this fix the build fails in `DatabaseDataLakeSettings.cpp` with a VLA diagnostic, because the layout computation hits an undeclared `SettingTypeTag` value for `Map`. Fast-test build report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/101783/463edd23905aa45f867156b09f6e7e3eadc02666/fast_test/build_clickhouse/build_clickhouse.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…actory time Addresses review feedback on PR ClickHouse#101783: * ClickHouse#101783 (comment) ("Don't use mini-languages inside strings. Instead, use structured data like Map.") * ClickHouse#101783 (comment) (datalake column-id mapping precedence) `FormatSettings::Parquet::column_field_ids` now stores `Int32` directly instead of `String`, and `FormatFactory` parses the `Map` setting at SET time. Integer values can be passed natively (`{'a': 1, 'b': 2}`) instead of encoded as strings; quoted-integer strings are still accepted for compatibility. When a datalake (e.g. Iceberg) writer hands `ParquetBlockOutputFormat` a column-id mapping, that mapping is now authoritative; combining it with `output_format_parquet_column_field_ids` or `output_format_parquet_auto_assign_field_ids` raises `BAD_ARGUMENTS` instead of silently producing a Parquet file whose `field_id`s no longer match table metadata. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses ClickHouse#101783 (comment): the previous SQL test only round-tripped values, so a regression that dropped `field_id` metadata from the Parquet schema would not have been caught. The test is now a shell script that uses pyarrow to read the written Parquet schema and print `field_id`s for each column. Covers explicit overrides, auto-assign, mixed override + auto-assign, the default (no `field_id`) path, string-encoded ids for compatibility, and the error cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eld_ids` Required by the `Post Hooks/feature_docs.py` check on PR ClickHouse#101783. CI report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/101783/463edd23905aa45f867156b09f6e7e3eadc02666/finish_workflow/post_hooks/python3%20.%2Fci%2Fjobs%2Fscripts%2Fworkflow_hooks%2Ffeature_docs.py.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickHouse's settings parser only accepts string literals as `Map` values
(integer literals fail with `SYNTAX_ERROR`), so the user-visible setting is
`output_format_parquet_column_field_ids = {'col_a': '1', ...}`. Update the
test to use that syntax everywhere and the docs / setting description to
match.
The local C++ side still accepts both `Int64`/`UInt64` (for programmatic
callers) and `String` (for SQL callers), so no implementation change is
needed.
Verified end-to-end with the local build — all five round-trip cases and
all five error cases match the reference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The push to the fork is blocked by the missing `workflow` OAuth scope. Restore `.github/workflows/pull_request.yml` to the pre-merge state so the branch can be pushed; the next maintainer-driven sync to master will pull the latest workflow definition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-field-id # Conflicts: # .github/workflows/pull_request.yml
|
Merged the latest Renumbered the test Rebuilt locally (clean) and ran The CI reds are all unrelated to this Parquet-output-only change: the |
|
Merged latest Rebuilt All 6 review threads are resolved. CI should now run on a current base. |
| {"enable_streaming_queries", false, false, "New setting"}, | ||
| {"optimize_prewhere_after_pushdown", false, false, "New setting that enables a second PREWHERE promotion pass to merge filters deposited above a MergeTree read step by later optimizations (predicate pushdown through JOIN, projection rewrites) into the existing PREWHERE chain."}, | ||
| {"output_format_parquet_column_field_ids", "", "", "New setting to specify explicit Parquet `field_id` overrides per column (`Map(String, Int32)`), useful for Iceberg compatibility."}, | ||
| {"output_format_parquet_auto_assign_field_ids", false, false, "New setting to auto-assign sequential Parquet `field_id`s to every output column, Iceberg-style."}, |
There was a problem hiding this comment.
Why not true by default?
Also a question - if we don't auto-assign field ids, what is wrong with that? - why field ids are even needed?
| {"show_remote_databases_in_system_tables", false, false, "Renamed from `show_data_lake_catalogs_in_system_tables` and broadened to also hide `MySQL` and `PostgreSQL` databases from `system.tables`, `system.columns` and `system.completions` by default, since enumerating their tables requires expensive remote calls. Users who relied on the previous behavior must set this setting to `true`. The old name is kept as an alias."}, | ||
| {"enable_streaming_queries", false, false, "New setting"}, | ||
| {"optimize_prewhere_after_pushdown", false, false, "New setting that enables a second PREWHERE promotion pass to merge filters deposited above a MergeTree read step by later optimizations (predicate pushdown through JOIN, projection rewrites) into the existing PREWHERE chain."}, | ||
| {"output_format_parquet_column_field_ids", "", "", "New setting to specify explicit Parquet `field_id` overrides per column (`Map(String, Int32)`), useful for Iceberg compatibility."}, |
There was a problem hiding this comment.
What is the scenario when a user needs to assign field ids manually? It sounds cumbersome.
|
|
||
| if (const auto * array_type = typeid_cast<const DataTypeArray *>(inner.get())) | ||
| { | ||
| enumerateFieldPaths(name + ".element", array_type->getNestedType(), out); |
There was a problem hiding this comment.
I don't like this. Here we introduced a new notation for almost no reason. If there is no better way, I will close this PR and withdraw the task.
There was a problem hiding this comment.
But if we are okay with always assign field ids automatically, we can remove this code.
| else if (const auto * tuple_type = typeid_cast<const DataTypeTuple *>(inner.get())) | ||
| { | ||
| for (size_t i = 0; i < tuple_type->getElements().size(); ++i) | ||
| enumerateFieldPaths(name + "." + tuple_type->getNameByPosition(i + 1), tuple_type->getElement(i), out); |
alexey-milovidov
left a comment
There was a problem hiding this comment.
Other than the manual assign feature, all ok.
|
Ok. I understand why manual assignment is needed. |
|
If we make substream names exactly as they are in ClickHouse (so we don't invent anything for this task), it will be good. |
|
Apparently, the substream names are this way in Parquet :( |
| namespace | ||
| { | ||
|
|
||
| /// Whether `prepareGeoColumn` will collapse this top-level column into a single WKB `String` |
There was a problem hiding this comment.
Why is the code complex?
`convertSchema` validated the user-supplied Iceberg `field_id` map against the
original ClickHouse column type, but for a recognized geo custom type (`Point`,
`LineString`, `Polygon`, `MultiLineString`, `MultiPolygon`, `Geometry`)
`prepareGeoColumn` later collapses the column into a single WKB `String` field
when `output_format_parquet_geometadata` is enabled (the default). So
`validateIcebergFieldIds` walked the pre-WKB `Tuple` / `Array` shape and demanded
nested `field_id`s (e.g. `point.x`, `point.y`) that the writer never emits and
that `buildColumnFieldIds` never produces — wrongly rejecting a valid top-level
override and breaking auto-assign for geo columns.
For example, `SELECT (1.0, 2.0)::Point AS point SETTINGS
output_format_parquet_column_field_ids = {'point': '7'}` (or auto-assign) failed
with `INCORRECT_DATA` even though the emitted Parquet schema has only the
top-level `point` WKB field.
Fix: share the geo-scalar rule between the `field_id` builder and the validator.
`isGeoColumnWrittenAsWKBScalar` moves from an anonymous namespace in
`ParquetBlockOutputFormat.cpp` to `DB::Parquet` (declared in `Write.h`), and
`convertSchema` now validates a geo-WKB-scalar column as a single top-level field
instead of recursing into its nested shape — matching the schema actually
written and the map `buildColumnFieldIds` builds.
CI report (before fix): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=8ef07d4eb682a8ca2c29f3d81fc9d2ba557df4db
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| /// When a datalake (e.g. Iceberg) writer hands us a column-id mapping, that mapping is | ||
| /// the source of truth — letting session settings override it would produce Parquet files | ||
| /// whose `field_id`s no longer match the table metadata, breaking subsequent reads. | ||
| if (has_metadata_mapping && user_set_field_ids) |
There was a problem hiding this comment.
This guard protects the important invariant that datalake metadata remains the source of truth for field_ids, but I don't see regression coverage for it. The new 04321_parquet_column_field_ids test only writes direct Parquet files, so removing this check would still leave the current test green while reintroducing the path where session settings can override an Iceberg table's column-id mapping.
Please add a focused datalake/Iceberg write case that has format_filter_info_->column_mapper present and verifies that either output_format_parquet_column_field_ids or output_format_parquet_auto_assign_field_ids fails with BAD_ARGUMENTS.
There was a problem hiding this comment.
The guard at ParquetBlockOutputFormat.cpp:195 fires only when format_filter_info_->column_mapper is set, which happens exclusively when writing through a datalake (Iceberg) storage that provides its own column-id mapping. The stateless test writes via file() / clickhouse-local, which never has a column_mapper, so the guard is not reachable there. Covering it needs the Iceberg integration-test harness (write to a real datalake table with output_format_parquet_column_field_ids / output_format_parquet_auto_assign_field_ids set and assert BAD_ARGUMENTS); will add it as a follow-up integration test rather than expanding the stateless test.
There was a problem hiding this comment.
This still doesn't cover the guard at ParquetBlockOutputFormat.cpp:195.
tests/integration/test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py:50-61 sets output_format_parquet_auto_assign_field_ids / output_format_parquet_column_field_ids on the INSERT, but StorageObjectStorage freezes FormatSettings for engine-backed Iceberg tables at CREATE TABLE time (src/Storages/ObjectStorage/registerStorageObjectStorage.cpp:54-69) and then reuses those stored settings on write (src/Storages/ObjectStorage/StorageObjectStorage.cpp:630-631). Because the per-INSERT settings never reach MultipleFileWriter/ParquetBlockOutputFormat, both inserts bypass this guard and the test still doesn't prove the invariant.
To make this a real regression test, either put the setting on the table definition (CREATE TABLE ... SETTINGS ..., via additional_settings in create_iceberg_table) or write through the icebergLocal(...) table function, where query-time format settings are resolved on the write path.
…sHistory
After merging `master`, the `02995_new_settings_history` baseline was bumped
to 26.6, so settings recorded in the `26.6` block of `SettingsChangesHistory.cpp`
are now treated as part of the released baseline rather than as new in the
current development cycle. As a result, the check started reporting:
PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: output_format_parquet_column_field_ids WAS ADDED
PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: output_format_parquet_auto_assign_field_ids WAS ADDED
Move `output_format_parquet_column_field_ids` and
`output_format_parquet_auto_assign_field_ids` from the `26.6` block to the
`26.7` block so they are recognized as newly added settings, fixing the
`02995_new_settings_history` check.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=621a9fdc7750a62bd2383c8b9c8e319ccbb31e28&name_0=PR&name_1=Fast%20test
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d_ids
Address review feedback: the `output_format_parquet_column_field_ids` docs
row described explicit overrides but did not state that, when
`output_format_parquet_auto_assign_field_ids` is disabled, a non-empty map
must cover every output column and every nested field. As written, a user
could try a partial map like `{'a': '1'}` and only discover at runtime that
`buildColumnFieldIds` raises `BAD_ARGUMENTS`. Spell out the full-coverage
requirement and point to `output_format_parquet_auto_assign_field_ids` as the
way to fill the remaining paths automatically.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed
The two remaining AI-review findings on the datalake path are addressed in their threads:
The previous CI reds are unrelated, known master-wide flakes (not touched by this Parquet-output change):
CI re-runs on the fresh base after this push. Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=latest&name_0=PR |
`ParquetBlockOutputFormat` rejects `output_format_parquet_column_field_ids` / `output_format_parquet_auto_assign_field_ids` when a datalake writer (Iceberg) supplies its own column-id mapping via `format_filter_info->column_mapper`, because that mapping is the authoritative source for the Parquet `field_id`s and letting the session settings override it would emit `field_id`s that no longer match the table metadata. That guard is not reachable from the stateless `04321_parquet_column_field_ids` test, which writes via `file` / `clickhouse-local` and therefore never has a `column_mapper`. This adds a focused integration test in `test_storage_iceberg_no_spark` that writes to an Iceberg table with each setting enabled and asserts the write fails with `BAD_ARGUMENTS`, so removing the guard would no longer leave CI green. Addresses the AI Review "Request changes" verdict and the review thread on `ParquetBlockOutputFormat.cpp` asking for coverage of this guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Added the datalake-guard regression test that the AI Review "Request changes" verdict (and the New test
This closes the gap the stateless CI: the only red on the prior commit was The remaining open threads are maintainer design calls (the manual-assign rationale; the substream-name notation, which matches Parquet's own; and the pre-existing dotted-path behavior on the datalake metadata path) — left for @alexey-milovidov. |
|
Heads-up on the two CI reds on 1. The new datalake-guard test
The practical consequence is larger than the test: these settings can only reach an Iceberg write through the table's The fix direction depends on the design call in the open threads:
I left it rather than rewrite the test, since it's tied to that decision rather than a mechanical fix. 2. |
|
Merged The only PR-caused red is the new Root cause of the test failure — the guard is unreachable on the path the test uses. The test creates a table engine ( A corollary that's directly relevant to the still-open manual-assign discussion: on the natural Two ways to make the test actually exercise the guard, if we keep it:
I left the fix to you: whether to keep the guard at all is entangled with the manual-assign notation thread you're still deciding, and I couldn't run the integration suite in this environment to confirm which of the two rewrites is cleanest for the |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 144/176 (81.82%) · Uncovered code |

What this PR does
Two related knobs for writing Parquet files with column
field_idmetadata — needed for Apache Iceberg compatibility, which identifies columns byfield_idrather than by name.Both knobs can be used independently or together.
1. Explicit per-column overrides
New setting
output_format_parquet_column_field_ids : Map(String, Int32).Each output column gets the id specified in the map. The map must cover every output column (unless auto-assign below is enabled to fill the gaps), and ids must be unique; violations raise
BAD_ARGUMENTSwith specific messages.2. Auto-assign (Iceberg writer convention)
New setting
output_format_parquet_auto_assign_field_ids : Bool, defaultfalse.When enabled, every output column is assigned a unique sequential
field_idstarting at1, matching the convention used by Apache Iceberg writers.3. Mixed mode
The two settings compose: override-map entries win for the columns they mention; auto-assign fills the remaining columns with the smallest unused positive ids.
Implementation
src/Core/FormatFactorySettings.h— declare both settings.src/Formats/FormatSettings.h— carry astd::vector<std::pair<String, Int32>>of overrides and theauto_assign_field_idsbool; no string parsing on the hot path.src/Formats/FormatFactory.cpp— convert theMapsetting into the pair vector, with structural validation (tuple shape, string keys, integer values inInt32range).src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp—buildColumnFieldIds()resolves overrides against the actual output header, auto-assigns unused ids if the flag is on, and rejects unknown columns, duplicate ids, and non-covering overrides.Test
tests/queries/0_stateless/04321_parquet_column_field_ids.shwrites Parquet files withclickhouse-local, reads thefield_idmetadata back withpyarrow, and covers:field_ids, unchanged behavior).Array.element,Tuplesubfields, andMap.key/Map.value.Point/Array/Tupleshape collapses to a single top-level field).Closes: #58753
Changelog category (leave one):
Changelog entry:
Added
output_format_parquet_column_field_ids(Map(String, Int32)) to set explicit Parquetfield_idoverrides per column andoutput_format_parquet_auto_assign_field_ids(Bool) to auto-assign sequentialfield_ids to every output column, matching the Apache Iceberg writer convention. The two settings can be combined.Documentation entry for user-facing changes
Note
Medium Risk
Touches the Parquet write path and schema generation to emit
field_idmetadata, which could affect interoperability and schema expectations; defaults are preserved unless new settings are enabled.Overview
Adds two new Parquet output settings to control column
field_idmetadata for Iceberg compatibility:output_format_parquet_column_field_ids(explicit per-column overrides) andoutput_format_parquet_auto_assign_field_ids(sequential auto-assignment starting at 1).FormatFactorynow parses/validates theMapsetting into typed overrides up-front, whileParquetBlockOutputFormatresolves the finalfield_idmapping (including error checks for unknown columns, duplicates, negative ids, and incomplete coverage when auto-assign is off) and explicitly rejects using these settings when a datalake writer provides its own column-id mapping.Documentation and a new stateless test are added to verify correct
field_idemission viapyarrowand expected failure modes.Reviewed by Cursor Bugbot for commit ce5f371. Bugbot is set up for automated code reviews on this repo. Configure here.