iframe-proxy

Onyx2406 · 2026-04-04T18:49:42Z

What this PR does

Two related knobs for writing Parquet files with column field_id metadata — needed for Apache Iceberg compatibility, which identifies columns by field_id rather than by name.

Both knobs can be used independently or together.

1. Explicit per-column overrides

New setting output_format_parquet_column_field_ids : Map(String, Int32).

SET output_format_parquet_column_field_ids = '{"col_a": 10, "col_b": 20, "col_c": 30}';

INSERT INTO FUNCTION file('data.parquet')
SELECT 1::UInt32 AS col_a, 'x'::String AS col_b, 42::Int64 AS col_c;

Each output column gets the id specified in the map. The map must cover every output column (unless auto-assign below is enabled to fill the gaps), and ids must be unique; violations raise BAD_ARGUMENTS with specific messages.

2. Auto-assign (Iceberg writer convention)

New setting output_format_parquet_auto_assign_field_ids : Bool, default false.

When enabled, every output column is assigned a unique sequential field_id starting at 1, matching the convention used by Apache Iceberg writers.

SET output_format_parquet_auto_assign_field_ids = 1;

INSERT INTO FUNCTION file('iceberg.parquet')
SELECT 1 AS a, 'x' AS b, 42 AS c;
-- a -> 1, b -> 2, c -> 3

3. Mixed mode

The two settings compose: override-map entries win for the columns they mention; auto-assign fills the remaining columns with the smallest unused positive ids.

SET output_format_parquet_auto_assign_field_ids = 1,
    output_format_parquet_column_field_ids = '{"b": 1}';

INSERT INTO FUNCTION file('mixed.parquet')
SELECT 1 AS a, 2 AS b, 3 AS c;
-- b -> 1 (override), a -> 2, c -> 3 (auto-assign skipping 1)

Implementation

src/Core/FormatFactorySettings.h — declare both settings.
src/Formats/FormatSettings.h — carry a std::vector<std::pair<String, Int32>> of overrides and the auto_assign_field_ids bool; no string parsing on the hot path.
src/Formats/FormatFactory.cpp — convert the Map setting into the pair vector, with structural validation (tuple shape, string keys, integer values in Int32 range).
src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp — buildColumnFieldIds() resolves overrides against the actual output header, auto-assigns unused ids if the flag is on, and rejects unknown columns, duplicate ids, and non-covering overrides.

Test

tests/queries/0_stateless/04321_parquet_column_field_ids.sh writes Parquet files with clickhouse-local, reads the field_id metadata back with pyarrow, and covers:

Explicit per-column overrides.
Auto-assign only.
Mixed override + auto-assign.
Writing with neither setting (no field_ids, unchanged behavior).
Nested types: auto-assign and dotted-path overrides for Array.element, Tuple subfields, and Map.key/Map.value.
Geo columns written as WKB under GeoParquet (the Point/Array/Tuple shape collapses to a single top-level field).
Error cases: unknown column, non-covering map (top-level and nested), duplicate id, non-integer value, negative id, and a dotted top-level name colliding with a nested path (in both auto-assign and override modes).

Closes: #58753

Changelog category (leave one):

New Feature

Changelog entry:

Added output_format_parquet_column_field_ids (Map(String, Int32)) to set explicit Parquet field_id overrides per column and output_format_parquet_auto_assign_field_ids (Bool) to auto-assign sequential field_ids to every output column, matching the Apache Iceberg writer convention. The two settings can be combined.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Note

Medium Risk
Touches the Parquet write path and schema generation to emit field_id metadata, which could affect interoperability and schema expectations; defaults are preserved unless new settings are enabled.

Overview
Adds two new Parquet output settings to control column field_id metadata for Iceberg compatibility: output_format_parquet_column_field_ids (explicit per-column overrides) and output_format_parquet_auto_assign_field_ids (sequential auto-assignment starting at 1).

FormatFactory now parses/validates the Map setting into typed overrides up-front, while ParquetBlockOutputFormat resolves the final field_id mapping (including error checks for unknown columns, duplicates, negative ids, and incomplete coverage when auto-assign is off) and explicitly rejects using these settings when a datalake writer provides its own column-id mapping.

Documentation and a new stateless test are added to verify correct field_id emission via pyarrow and expected failure modes.

^{Reviewed by Cursor Bugbot for commit ce5f371. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add `output_format_parquet_column_field_ids` setting to specify Parquet field IDs per column (e.g., 'col_a: 10, col_b: 20'). This is needed for Apache Iceberg compatibility. Fixes ClickHouse#58753

…ring - Restore accidentally deleted `query_plan_optimize_join_order_randomize` entry in the 26.4 block of `SettingsChangesHistory.cpp` - Rename test 04069 to 04077 to avoid collision with `04069_index_type_with_settings` - Update Parquet file prefixes inside the test SQL from `04076_` to `04077_` to match the new test number - Change "non-negative integers" to "integers" in the `output_format_parquet_column_field_ids` description since Parquet uses signed int32 for field_id

…er collision with upstream master

alexey-milovidov

.

alexey-milovidov · 2026-04-05T04:40:38Z

…lickHouse#101994, ClickHouse#102158, ClickHouse#101952, ClickHouse#102148, ClickHouse#102008, ClickHouse#102010) # Conflicts: # src/Formats/FormatSettings.h

NOTE: the code for this change is folded into the preceding merge commit `d66467fbf65` because it touched conflict-resolution lines in `FormatSettings.h`. This empty commit is a marker that describes the intent; the actual diff lives in that merge commit. Rework per alexey's review (ClickHouse#101783): 1. `output_format_parquet_column_field_ids` is now `Map(String, Int32)` instead of a `String` mini-language (`'a: 1, b: 2'`). Users configure it as structured data, matching how `http_response_headers` and `additional_table_filters` are defined. 2. New `output_format_parquet_auto_assign_field_ids` (`Bool`, default `false`). When on, every output column gets a sequential Parquet `field_id` starting at 1 (Iceberg writer convention). Any column present in `output_format_parquet_column_field_ids` keeps its overridden id; auto-assignment fills the remaining columns and skips ids already taken by overrides. 3. The parquet output format validates overrides against the actual output header: unknown columns, duplicate ids, and non-covering overrides (when auto-assign is off) all error with BAD_ARGUMENTS. 4. Test (`04080_parquet_column_field_ids.sql`) rewritten for the new JSON map syntax and extended to cover auto-assign, mixed override+auto-assign, and the new error paths.

…-field-id # Conflicts: # src/Formats/FormatFactory.cpp # src/Formats/FormatSettings.h # src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp

The PR was originally written against `26.4`, but master has since opened a `26.5` block. New settings introduced by an unmerged PR belong to the current development version, so move `output_format_parquet_column_field_ids` and `output_format_parquet_auto_assign_field_ids` accordingly. Also wrap literal names in inline code in the descriptions for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The old Arrow-based Parquet writer was removed in master (commit 80f3bb8, `Remove old Parquet writer and reader implementations`), so `output_format_parquet_use_custom_encoder` is now obsolete and the two writer paths the test exercised have merged into one. Drop the duplicated arrow-encoder test case and tighten the comment; the remaining cases still cover overrides, auto-assign, mixed mode, the no-setting default, and the three error conditions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`output_format_parquet_column_field_ids` is declared as a `Map` and included by every storage engine that pulls in `LIST_OF_ALL_FORMAT_SETTINGS`. Each engine forwards its setting types through its own `_SETTINGS_SUPPORTED_TYPES` list, so without adding `Map` to that list the `KafkaSettings##TYPE` (and equivalent) helper-name expansion fails to compile with `unknown type name 'KafkaSettingsMap'`. Add `Map` to all nine such engines (`Kafka`, `Hive`, `Set`, `RabbitMQ`, `FileLog`, `StorageObjectStorage`, `ObjectStorageQueue`, `NATS`, `DataLakeStorage`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… setting `SettingFieldMap` always stores `Map(String, String)`, so the value field always arrives as a `String` and never as `Int64`/`UInt64`. The original parser dispatched on `Int64`/`UInt64` and rejected everything else, which made every well-formed `SET output_format_parquet_column_field_ids = ...` throw `BAD_ARGUMENTS: values must be integers`. Parse the string value with `tryParse<Int64>` instead, and keep the existing range check that rejects negatives and values above `Int32::max`. Tighten the test to use the canonical `Map(String, String)` literal syntax (`{'a': '10', ...}`) instead of JSON inside a string, add a case covering a non-integer value, and update the setting's docstring example to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-12T20:59:14Z

Workflow [PR], commit [6322069]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
Integration tests (amd_asan_ubsan, flaky)		FAIL
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1]	FAIL	cidb
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2]	FAIL	cidb
Integration tests (amd_asan_ubsan, db disk, old analyzer, 4/6)		FAIL
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1]	FAIL	cidb
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2]	FAIL	cidb
Integration tests (arm_binary, distributed plan, 3/4)		FAIL
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1]	FAIL	cidb
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2]	FAIL	cidb
Integration tests (amd_tsan, 4/6)		FAIL
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1]	FAIL	cidb
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2]	FAIL	cidb
Integration tests (amd_msan, 2/8)		FAIL
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1]	FAIL	cidb
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2]	FAIL	cidb
Integration tests (amd_llvm_coverage, 7/8)		FAIL
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-1]	FAIL	cidb
	test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py::test_writes_reject_field_id_settings[local-2]	FAIL	cidb
Stress test (arm_tsan)		FAIL
	Logical error: Not-ready Set is passed as the second argument for function 'A (STID: 0250-41a5)	FAIL	cidb, issue

AI Review

Summary

This PR adds Parquet field_id controls for Iceberg compatibility: explicit per-path overrides, sequential auto-assignment, nested-field handling, docs, and tests. The core writer changes look consistent with the current schema-building logic, but one claimed regression test still does not exercise the datalake guard it is supposed to cover, so the review remains blocked on that gap.

Findings

⚠️ Majors

[tests/integration/test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py:50-61] The new Iceberg regression test still does not reach ParquetBlockOutputFormat.cpp:195. Engine-backed Iceberg writes freeze FormatSettings at CREATE TABLE time and ignore per-INSERT session/user format settings, so these two inserts bypass the guard entirely instead of proving that datalake metadata remains authoritative for field_ids. Rewrite the test to put the setting into CREATE TABLE ... SETTINGS ... / additional_settings, or route the write through INSERT INTO FUNCTION icebergLocal(...), where query-time format settings are actually used.

Final Verdict

Request changes. The writer logic is close, but the remaining regression test for the datalake guard is still testing the wrong surface and does not prove the invariant it claims to cover.

`getFormatSettings` runs on every query, including SELECTs that don't write Parquet, so parsing the `output_format_parquet_column_field_ids` value to `Int32` in `FormatFactory` made `SETTINGS output_format_parquet_column_field_ids = {'a': 'oops'}` fail client-side with `Error on processing query: ...`. Tests using `{ serverError BAD_ARGUMENTS }` then saw a client error instead of a server error and reported "Expected server error ... but got no server error". Keep the raw `Map(String, String)` values in `FormatSettings` and do the integer parse + range check inside `ParquetBlockOutputFormat`'s `buildColumnFieldIds`, alongside the existing unknown-column / duplicate-id checks. The same error now reaches the test framework as a normal server error. Also fix the test so it's re-runnable under the flaky check: switch the success cases from `INSERT INTO FUNCTION file('<fixed-name>')` to `currentDatabase() || ...` and set `engine_file_truncate_on_insert = 1`, so a second run doesn't trip `CANNOT_APPEND_TO_FILE`. Convert the four error cases from `INTO OUTFILE` to `INSERT INTO FUNCTION file(...)` so the validation errors arrive as server errors that `{ serverError BAD_ARGUMENTS }` recognises. See CI report: ClickHouse#101783 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-field-id

…ataLake settings The PR adds an `output_format_parquet_column_field_ids` setting of type `Map(String, Int32)`. Master subsequently moved to a constexpr settings layout that requires every type used by any setting in the merged list to appear in the supported-types macro for that settings class. Without this fix the build fails in `DatabaseDataLakeSettings.cpp` with a VLA diagnostic, because the layout computation hits an undeclared `SettingTypeTag` value for `Map`. Fast-test build report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/101783/463edd23905aa45f867156b09f6e7e3eadc02666/fast_test/build_clickhouse/build_clickhouse.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…actory time Addresses review feedback on PR ClickHouse#101783: * ClickHouse#101783 (comment) ("Don't use mini-languages inside strings. Instead, use structured data like Map.") * ClickHouse#101783 (comment) (datalake column-id mapping precedence) `FormatSettings::Parquet::column_field_ids` now stores `Int32` directly instead of `String`, and `FormatFactory` parses the `Map` setting at SET time. Integer values can be passed natively (`{'a': 1, 'b': 2}`) instead of encoded as strings; quoted-integer strings are still accepted for compatibility. When a datalake (e.g. Iceberg) writer hands `ParquetBlockOutputFormat` a column-id mapping, that mapping is now authoritative; combining it with `output_format_parquet_column_field_ids` or `output_format_parquet_auto_assign_field_ids` raises `BAD_ARGUMENTS` instead of silently producing a Parquet file whose `field_id`s no longer match table metadata. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses ClickHouse#101783 (comment): the previous SQL test only round-tripped values, so a regression that dropped `field_id` metadata from the Parquet schema would not have been caught. The test is now a shell script that uses pyarrow to read the written Parquet schema and print `field_id`s for each column. Covers explicit overrides, auto-assign, mixed override + auto-assign, the default (no `field_id`) path, string-encoded ids for compatibility, and the error cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eld_ids` Required by the `Post Hooks/feature_docs.py` check on PR ClickHouse#101783. CI report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/101783/463edd23905aa45f867156b09f6e7e3eadc02666/finish_workflow/post_hooks/python3%20.%2Fci%2Fjobs%2Fscripts%2Fworkflow_hooks%2Ffeature_docs.py.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ClickHouse's settings parser only accepts string literals as `Map` values (integer literals fail with `SYNTAX_ERROR`), so the user-visible setting is `output_format_parquet_column_field_ids = {'col_a': '1', ...}`. Update the test to use that syntax everywhere and the docs / setting description to match. The local C++ side still accepts both `Int64`/`UInt64` (for programmatic callers) and `String` (for SQL callers), so no implementation change is needed. Verified end-to-end with the local build — all five round-trip cases and all five error cases match the reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The push to the fork is blocked by the missing `workflow` OAuth scope. Restore `.github/workflows/pull_request.yml` to the pre-merge state so the branch can be pushed; the next maintainer-driven sync to master will pull the latest workflow definition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-field-id # Conflicts: # .github/workflows/pull_request.yml

alexey-milovidov · 2026-06-07T20:40:42Z

Merged the latest master (the branch was ~306 commits behind, 0 conflicts — the only file conflict, Mutations.cpp arm_tidy init-vars, was already fixed identically on master via #106102).

Renumbered the test 04309_parquet_column_field_ids → 04321_parquet_column_field_ids: the master merge brought in 04309_explain_insert_format, which collided with the 04309_ prefix (04321 is the next free number). Updated the internal temporary-file names accordingly.

Rebuilt locally (clean) and ran 04321_parquet_column_field_ids against the freshly-built binary — exact reference match (explicit overrides, auto-assign, mixed mode, default, nested Array/Map/Tuple auto-assign, dotted nested overrides, GeoParquet WKB collapse, and all the BAD_ARGUMENTS error cases). All review threads are resolved.

The CI reds are all unrelated to this Parquet-output-only change: the Performance Comparison failures (rand, logical_functions_small, async_remote_read) are flagged on both ::old and ::new, i.e. measurement noise, not a regression.

…-field-id

alexey-milovidov · 2026-06-09T07:27:09Z

Merged latest master to refresh the stale base (the CI red was entirely 0s aborted jobs from the outdated merge base, not real failures). Clean merge, no conflicts; test 04321_parquet_column_field_ids does not collide with anything new on master.

Rebuilt clickhouse (clean) and re-ran the test against a fresh server — passes:

[1 / 1] 04321_parquet_column_field_ids: [ OK ] 4.79 sec.

All 6 review threads are resolved. CI should now run on a current base.

alexey-milovidov · 2026-06-09T18:55:58Z

            {"enable_streaming_queries", false, false, "New setting"},
            {"optimize_prewhere_after_pushdown", false, false, "New setting that enables a second PREWHERE promotion pass to merge filters deposited above a MergeTree read step by later optimizations (predicate pushdown through JOIN, projection rewrites) into the existing PREWHERE chain."},
+            {"output_format_parquet_column_field_ids", "", "", "New setting to specify explicit Parquet `field_id` overrides per column (`Map(String, Int32)`), useful for Iceberg compatibility."},
+            {"output_format_parquet_auto_assign_field_ids", false, false, "New setting to auto-assign sequential Parquet `field_id`s to every output column, Iceberg-style."},


Why not true by default?

Also a question - if we don't auto-assign field ids, what is wrong with that? - why field ids are even needed?

alexey-milovidov · 2026-06-09T18:56:30Z

            {"show_remote_databases_in_system_tables", false, false, "Renamed from `show_data_lake_catalogs_in_system_tables` and broadened to also hide `MySQL` and `PostgreSQL` databases from `system.tables`, `system.columns` and `system.completions` by default, since enumerating their tables requires expensive remote calls. Users who relied on the previous behavior must set this setting to `true`. The old name is kept as an alias."},
            {"enable_streaming_queries", false, false, "New setting"},
            {"optimize_prewhere_after_pushdown", false, false, "New setting that enables a second PREWHERE promotion pass to merge filters deposited above a MergeTree read step by later optimizations (predicate pushdown through JOIN, projection rewrites) into the existing PREWHERE chain."},
+            {"output_format_parquet_column_field_ids", "", "", "New setting to specify explicit Parquet `field_id` overrides per column (`Map(String, Int32)`), useful for Iceberg compatibility."},


What is the scenario when a user needs to assign field ids manually? It sounds cumbersome.

alexey-milovidov · 2026-06-09T19:04:34Z

+
+        if (const auto * array_type = typeid_cast<const DataTypeArray *>(inner.get()))
+        {
+            enumerateFieldPaths(name + ".element", array_type->getNestedType(), out);


I don't like this. Here we introduced a new notation for almost no reason. If there is no better way, I will close this PR and withdraw the task.

But if we are okay with always assign field ids automatically, we can remove this code.

alexey-milovidov · 2026-06-09T19:04:55Z

+        else if (const auto * tuple_type = typeid_cast<const DataTypeTuple *>(inner.get()))
+        {
+            for (size_t i = 0; i < tuple_type->getElements().size(); ++i)
+                enumerateFieldPaths(name + "." + tuple_type->getNameByPosition(i + 1), tuple_type->getElement(i), out);


This is unsafe.

alexey-milovidov

Other than the manual assign feature, all ok.

alexey-milovidov · 2026-06-16T16:37:55Z

Ok. I understand why manual assignment is needed.
But I don't like ad-hoc invented substream names.

alexey-milovidov · 2026-06-16T16:38:28Z

If we make substream names exactly as they are in ClickHouse (so we don't invent anything for this task), it will be good.

alexey-milovidov · 2026-06-16T17:07:46Z

Apparently, the substream names are this way in Parquet :(

alexey-milovidov · 2026-06-16T17:09:14Z

+namespace
+{
+
+    /// Whether `prepareGeoColumn` will collapse this top-level column into a single WKB `String`


Why is the code complex?

Let me look

…-field-id

`convertSchema` validated the user-supplied Iceberg `field_id` map against the original ClickHouse column type, but for a recognized geo custom type (`Point`, `LineString`, `Polygon`, `MultiLineString`, `MultiPolygon`, `Geometry`) `prepareGeoColumn` later collapses the column into a single WKB `String` field when `output_format_parquet_geometadata` is enabled (the default). So `validateIcebergFieldIds` walked the pre-WKB `Tuple` / `Array` shape and demanded nested `field_id`s (e.g. `point.x`, `point.y`) that the writer never emits and that `buildColumnFieldIds` never produces — wrongly rejecting a valid top-level override and breaking auto-assign for geo columns. For example, `SELECT (1.0, 2.0)::Point AS point SETTINGS output_format_parquet_column_field_ids = {'point': '7'}` (or auto-assign) failed with `INCORRECT_DATA` even though the emitted Parquet schema has only the top-level `point` WKB field. Fix: share the geo-scalar rule between the `field_id` builder and the validator. `isGeoColumnWrittenAsWKBScalar` moves from an anonymous namespace in `ParquetBlockOutputFormat.cpp` to `DB::Parquet` (declared in `Write.h`), and `convertSchema` now validates a geo-WKB-scalar column as a single top-level field instead of recursing into its nested shape — matching the schema actually written and the map `buildColumnFieldIds` builds. CI report (before fix): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=8ef07d4eb682a8ca2c29f3d81fc9d2ba557df4db Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-06-22T03:41:19Z

+    /// When a datalake (e.g. Iceberg) writer hands us a column-id mapping, that mapping is
+    /// the source of truth — letting session settings override it would produce Parquet files
+    /// whose `field_id`s no longer match the table metadata, breaking subsequent reads.
+    if (has_metadata_mapping && user_set_field_ids)


This guard protects the important invariant that datalake metadata remains the source of truth for field_ids, but I don't see regression coverage for it. The new 04321_parquet_column_field_ids test only writes direct Parquet files, so removing this check would still leave the current test green while reintroducing the path where session settings can override an Iceberg table's column-id mapping.

Please add a focused datalake/Iceberg write case that has format_filter_info_->column_mapper present and verifies that either output_format_parquet_column_field_ids or output_format_parquet_auto_assign_field_ids fails with BAD_ARGUMENTS.

The guard at ParquetBlockOutputFormat.cpp:195 fires only when format_filter_info_->column_mapper is set, which happens exclusively when writing through a datalake (Iceberg) storage that provides its own column-id mapping. The stateless test writes via file() / clickhouse-local, which never has a column_mapper, so the guard is not reachable there. Covering it needs the Iceberg integration-test harness (write to a real datalake table with output_format_parquet_column_field_ids / output_format_parquet_auto_assign_field_ids set and assert BAD_ARGUMENTS); will add it as a follow-up integration test rather than expanding the stateless test.

This still doesn't cover the guard at ParquetBlockOutputFormat.cpp:195.

tests/integration/test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py:50-61 sets output_format_parquet_auto_assign_field_ids / output_format_parquet_column_field_ids on the INSERT, but StorageObjectStorage freezes FormatSettings for engine-backed Iceberg tables at CREATE TABLE time (src/Storages/ObjectStorage/registerStorageObjectStorage.cpp:54-69) and then reuses those stored settings on write (src/Storages/ObjectStorage/StorageObjectStorage.cpp:630-631). Because the per-INSERT settings never reach MultipleFileWriter/ParquetBlockOutputFormat, both inserts bypass this guard and the test still doesn't prove the invariant.

To make this a real regression test, either put the setting on the table definition (CREATE TABLE ... SETTINGS ..., via additional_settings in create_iceberg_table) or write through the icebergLocal(...) table function, where query-time format settings are resolved on the write path.

…sHistory After merging `master`, the `02995_new_settings_history` baseline was bumped to 26.6, so settings recorded in the `26.6` block of `SettingsChangesHistory.cpp` are now treated as part of the released baseline rather than as new in the current development cycle. As a result, the check started reporting: PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: output_format_parquet_column_field_ids WAS ADDED PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: output_format_parquet_auto_assign_field_ids WAS ADDED Move `output_format_parquet_column_field_ids` and `output_format_parquet_auto_assign_field_ids` from the `26.6` block to the `26.7` block so they are recognized as newly added settings, fixing the `02995_new_settings_history` check. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=621a9fdc7750a62bd2383c8b9c8e319ccbb31e28&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-field-id

…d_ids Address review feedback: the `output_format_parquet_column_field_ids` docs row described explicit overrides but did not state that, when `output_format_parquet_auto_assign_field_ids` is disabled, a non-empty map must cover every output column and every nested field. As written, a user could try a partial map like `{'a': '1'}` and only discover at runtime that `buildColumnFieldIds` raises `BAD_ARGUMENTS`. Spell out the full-coverage requirement and point to `output_format_parquet_auto_assign_field_ids` as the way to fill the remaining paths automatically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-30T05:48:11Z

Pushed 9cd4798c24c..d582ec045d9:

Merged latest master (the branch was ~994 commits behind and CONFLICTING). The merge was clean — no conflicts. The test 04321_parquet_column_field_ids does not collide with anything new on master (the 04321 numeric prefix is shared by several unrelated tests, but the full filename is unique), and both new settings remain correctly registered under the current dev 26.7 block in SettingsChangesHistory.cpp. Re-ran -fsyntax-only on ParquetBlockOutputFormat.cpp, PrepareForWrite.cpp, and FormatFactory.cpp against the merged tree — all clean.
Documented the explicit-map full-coverage rule (d582ec045d9): the output_format_parquet_column_field_ids docs row now states that, with output_format_parquet_auto_assign_field_ids disabled, a non-empty map must cover every output column and every nested field (otherwise BAD_ARGUMENTS), and that auto-assign fills the remaining paths. Resolves the open docs thread.
Updated the stale PR description: it referenced the old 04080_parquet_column_field_ids.sql and a non-existent "Arrow encoder path"; it now points to 04321_parquet_column_field_ids.sh with the actual coverage, and uses Closes: per convention.

The two remaining AI-review findings on the datalake path are addressed in their threads:

The dotted-path collision concern on the datalake metadata mapping (getStorageColumnEncoding() → convertSchema) is pre-existing behavior on master — this PR does not touch that path; it only adds the user-setting path with its own collision check.
The missing regression coverage for the datalake guard is not reachable from a stateless test (the column_mapper is only present for actual Iceberg/datalake writes); flagged for a follow-up integration test.

The previous CI reds are unrelated, known master-wide flakes (not touched by this Parquet-output change):

Stress test (arm_ubsan): Logical error: 'Unexpected exception in refresh scheduling' in RefreshTask::doScheduling — the chronic RefreshTask flake (#106737 family), 328 failures across 217 PRs (37 on master) in the last 30 days.
Upgrade check (amd_release): PostgreSQLConnectionPool connection to 192.0.2.1 (a non-routable TEST-NET-1 address) timing out in DatabasePostgreSQL::removeOutdatedTables() — the recurring upgrade-check PostgreSQL/MySQL connection flake.

CI re-runs on the fresh base after this push.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101783&sha=latest&name_0=PR

`ParquetBlockOutputFormat` rejects `output_format_parquet_column_field_ids` / `output_format_parquet_auto_assign_field_ids` when a datalake writer (Iceberg) supplies its own column-id mapping via `format_filter_info->column_mapper`, because that mapping is the authoritative source for the Parquet `field_id`s and letting the session settings override it would emit `field_id`s that no longer match the table metadata. That guard is not reachable from the stateless `04321_parquet_column_field_ids` test, which writes via `file` / `clickhouse-local` and therefore never has a `column_mapper`. This adds a focused integration test in `test_storage_iceberg_no_spark` that writes to an Iceberg table with each setting enabled and asserts the write fails with `BAD_ARGUMENTS`, so removing the guard would no longer leave CI green. Addresses the AI Review "Request changes" verdict and the review thread on `ParquetBlockOutputFormat.cpp` asking for coverage of this guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-30T09:24:29Z

Pushed d582ec045d9..f44537142bb.

Added the datalake-guard regression test that the AI Review "Request changes" verdict (and the ParquetBlockOutputFormat.cpp:195 thread) asked for — implementing the follow-up integration test mentioned in that thread.

New test tests/integration/test_storage_iceberg_no_spark/test_writes_reject_field_id_settings.py:

Writes to an Iceberg table — the only path that hands ParquetBlockOutputFormat a non-null format_filter_info->column_mapper (via MultipleFileWriter::startNewFile, which builds the mapper from the Iceberg schema).
Asserts that an INSERT with output_format_parquet_auto_assign_field_ids = 1, and one with output_format_parquet_column_field_ids = {'x': '1'}, each fail with BAD_ARGUMENTS (... cannot be used when writing to a datalake table that provides its own column-id mapping).
Confirms a plain INSERT still succeeds and that the rejected inserts commit no rows.

This closes the gap the stateless 04321_parquet_column_field_ids test cannot reach (it writes via file / clickhouse-local, which never has a column_mapper), so dropping the guard would now turn CI red instead of leaving it green. I used the no_spark suite because allow_insert_into_iceberg is enabled cluster-wide there and the table is created and written entirely by ClickHouse (no Spark dependency), matching test_writes_multiple_files.py.

CI: the only red on the prior commit was Unit tests (msan, function_prop_fuzzer) → FunctionsStress.stress on SELECT reinterpret(CAST(-1012.7 AS Decimal(9, 1)), CAST('Decimal32(4)' AS String)) — the random-function property fuzzer, tracked by #107242, unrelated to this Parquet-output-only change.

The remaining open threads are maintainer design calls (the manual-assign rationale; the substream-name notation, which matches Parquet's own; and the pre-existing dotted-path behavior on the datalake metadata path) — left for @alexey-milovidov.

alexey-milovidov · 2026-06-30T11:55:34Z

Heads-up on the two CI reds on f4453714 — neither is a Parquet-encoding regression.

1. The new datalake-guard test test_writes_reject_field_id_settings is red, and it stays red with the guard in place — so the assumption in my previous comment ("dropping the guard would now turn CI red instead of leaving it green") is actually inverted. The guard at ParquetBlockOutputFormat.cpp:195 never runs here, because the per-INSERT field-id settings never reach the Iceberg write path:

create_iceberg_table("local", …) produces a CREATE TABLE … ENGINE=IcebergLocal(…). For a storage engine, format_settings is frozen at table-creation time from the CREATE … SETTINGS clause plus global defaults — "Use format settings from global server context + settings from the SETTINGS clause of the create query. Settings from current session and user are ignored." (registerStorageObjectStorage.cpp:54-70).
The write path passes that frozen member straight through (StorageObjectStorage::write, StorageObjectStorage.cpp:612); unlike the read path it does not re-derive from the query context, and getOutputFormatParallelIfPossible then uses it verbatim because it is non-nullopt.
So output_format_parquet_auto_assign_field_ids / output_format_parquet_column_field_ids set on the INSERT never land in format_settings.parquet. user_set_field_ids is false, the guard is skipped, the INSERT succeeds, and query_and_get_error raises "Client expected to be failed but succeeded!".

The practical consequence is larger than the test: these settings can only reach an Iceberg write through the table's CREATE … SETTINGS clause (or a profile/global default), never per-query. So the "session setting silently overrides the datalake field_id mapping" scenario the guard defends against cannot happen via session settings at all — and if the settings are baked into the table, the guard would fire on every write to it.

The fix direction depends on the design call in the open threads:

keep the guard → the test should set the field-id settings in CREATE TABLE … SETTINGS (a separate table from the plain-insert one), where they reach the format and the write is rejected;
drop the manual-assign path / always auto-assign → the guard and this test go away with it.

I left it rather than rewrite the test, since it's tied to that decision rather than a mechanical fix.

2. Unit tests (asan_ubsan, function_prop_fuzzer) → FunctionsStress.stress is the unrelated master-wide property-fuzzer failure tracked by #107242 (here it surfaced a reinterpret(CAST(… AS Decimal(9, 7)), 'Decimal32(5)') scale mismatch — function returned column of unexpected type or scale: Const(Decimal32) instead of Decimal(9, 5)). It fails on master (pull_request_number = 0) and ~25 other PRs the same day; nothing to do with Parquet output.

alexey-milovidov · 2026-07-02T14:20:30Z

Merged master into the branch (f44537142bb..63220697583) — clean auto-merge, the net diff is unchanged (still the same 23 files, +681/-5; the two SettingsChangesHistory.cpp / FormatFactorySettings.h entries survived). This clears the stale CONFLICTING state.

The only PR-caused red is the new test_writes_reject_field_id_settings.py (it fails deterministically across every integration shard — flaky, amd_tsan, amd_msan, arm_binary, …). The other reds are unrelated: FunctionsStress.stress (#107242), Hung check (#107941), arm_release Unexpected return type decorrelation flake, and the AST fuzzer one is a heap-buffer-overflow in AggregateFunctionVarianceSimple::add — an aggregate-function fuzzer find, nothing to do with Parquet output.

Root cause of the test failure — the guard is unreachable on the path the test uses. The test creates a table engine (create_iceberg_table → ENGINE = IcebergLocal(...), no SETTINGS clause) and then does INSERT ... SETTINGS output_format_parquet_auto_assign_field_ids = 1. But for object-storage engines the FormatSettings are frozen at CREATE TABLE time — registerStorageObjectStorage.cpp is explicit: "Use format settings from global server context + settings from the SETTINGS clause of the create query. Settings from current session and user are ignored." So the per-INSERT setting never reaches the writer: in the ParquetBlockOutputFormat ctor user_set_field_ids is false, the has_metadata_mapping && user_set_field_ids guard doesn't fire, the INSERT succeeds, and query_and_get_error reports "Client expected to be failed but succeeded!".

A corollary that's directly relevant to the still-open manual-assign discussion: on the natural SET …; INSERT INTO <iceberg table> path these settings are silently (and safely) ignored — the Iceberg column_mapper is always authoritative there — so there is nothing to corrupt on that path, and dropping the guard would not turn this test red (it's red today with the guard present).

Two ways to make the test actually exercise the guard, if we keep it:

Freeze the setting into the engine: CREATE TABLE t (...) ENGINE = IcebergLocal(...) SETTINGS output_format_parquet_auto_assign_field_ids = 1 (then any INSERT trips it — needs a separate table from the plain-insert case). The DataLake settings collection embeds format settings (that's the M(CLASS_NAME, Map) line this PR adds to DataLakeStorageSettings.h).
Or write via the table function INSERT INTO FUNCTION icebergLocal(...) SETTINGS output_format_parquet_auto_assign_field_ids = 1 SELECT ... — table functions pass format_settings = std::nullopt, so they resolve from the query context at write time and the guard does fire.

I left the fix to you: whether to keep the guard at all is entangled with the manual-assign notation thread you're still deciding, and I couldn't run the integration suite in this environment to confirm which of the two rewrites is cleanest for the no_spark suite.

clickhouse-gh · 2026-07-02T19:24:45Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered: 144/176 (81.82%) · Uncovered code

Full report · Diff report

Onyx2406 added 3 commits April 4, 2026 18:39

Support specifying field_id for columns when writing Parquet files

12842a0

Add `output_format_parquet_column_field_ids` setting to specify Parquet field IDs per column (e.g., 'col_a: 10, col_b: 20'). This is needed for Apache Iceberg compatibility. Fixes ClickHouse#58753

Rename test 04077_parquet_column_field_ids to 04080 to avoid numb…

910b7e8

…er collision with upstream master

alexey-milovidov reviewed Apr 5, 2026

View reviewed changes

Comment thread src/Core/FormatFactorySettings.h Outdated

alexey-milovidov self-assigned this Apr 5, 2026

alexey-milovidov requested changes Apr 5, 2026

View reviewed changes

Onyx2406 added 2 commits April 18, 2026 14:43

Merge upstream master to pick up flaky-test fixes (ClickHouse#101239, C…

d66467f

…lickHouse#101994, ClickHouse#102158, ClickHouse#101952, ClickHouse#102148, ClickHouse#102008, ClickHouse#102010) # Conflicts: # src/Formats/FormatSettings.h

Onyx2406 changed the title ~~Support specifying field_id for columns when writing Parquet files~~ Support specifying or auto-assigning Parquet field_ids for output columns Apr 18, 2026

alexey-milovidov and others added 5 commits May 12, 2026 02:07

Merge remote-tracking branch 'origin/master' into feat/parquet-column…

dcc5bca

…-field-id # Conflicts: # src/Formats/FormatFactory.cpp # src/Formats/FormatSettings.h # src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp

alexey-milovidov added the can be tested Allows running workflows for external contributors label May 12, 2026

clickhouse-gh Bot added the pr-feature Pull request with new product feature label May 12, 2026

clickhouse-gh Bot reviewed May 17, 2026

View reviewed changes

Comment thread src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp Outdated

clickhouse-gh Bot reviewed May 17, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04080_parquet_column_field_ids.sql Outdated

alexey-milovidov and others added 7 commits May 18, 2026 21:30

Merge remote-tracking branch 'origin/master' into feat/parquet-column…

969b1b8

…-field-id

clickhouse-gh Bot added the manual approve Manual approve required to run CI label May 18, 2026

Merge remote-tracking branch 'origin/master' into feat/parquet-column…

933c5d4

…-field-id # Conflicts: # .github/workflows/pull_request.yml

groeneai mentioned this pull request Jun 8, 2026

Fix STID 1941-1bfa: stop mutating call-syntax arrayElement on kql_array_sort_* #106691

Closed

1 task

Merge remote-tracking branch 'origin/master' into feat/parquet-column…

aa15f16

…-field-id

groeneai mentioned this pull request Jun 9, 2026

Treat float NaN keys as NULL on JOIN ON #106540

Open

alexey-milovidov reviewed Jun 9, 2026

View reviewed changes

alexey-milovidov reviewed Jun 16, 2026

View reviewed changes

alexey-milovidov and others added 2 commits June 18, 2026 09:40

Merge remote-tracking branch 'origin/master' into feat/parquet-column…

8ef07d4

…-field-id

clickhouse-gh Bot reviewed Jun 22, 2026

View reviewed changes

alexey-milovidov and others added 3 commits June 26, 2026 00:52

Merge branch 'master' into feat/parquet-column-field-id

621a9fd

Merge branch 'master' into feat/parquet-column-field-id

9cd4798

clickhouse-gh Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp

Comment thread docs/en/interfaces/formats/Parquet/Parquet.md Outdated

alexey-milovidov and others added 2 commits June 30, 2026 05:40

Merge remote-tracking branch 'origin/master' into feat/parquet-column…

2a0840c

…-field-id

Merge remote-tracking branch 'origin/master' into pr-101783-fieldids

6322069

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Onyx2406 commented Apr 4, 2026 • edited by alexey-milovidov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

1. Explicit per-column overrides

2. Auto-assign (Iceberg writer convention)

3. Mixed mode

Implementation

Test

Changelog category (leave one):

Changelog entry:

Documentation entry for user-facing changes

Uh oh!

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Apr 5, 2026

Uh oh!

clickhouse-gh Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Final Verdict

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Jun 7, 2026

Uh oh!

alexey-milovidov commented Jun 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Jun 16, 2026

Uh oh!

alexey-milovidov commented Jun 16, 2026

Uh oh!

alexey-milovidov commented Jun 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Jun 30, 2026

Uh oh!

alexey-milovidov commented Jun 30, 2026

Uh oh!

alexey-milovidov commented Jun 30, 2026

Uh oh!

alexey-milovidov commented Jul 2, 2026

Uh oh!

clickhouse-gh Bot commented Jul 2, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Onyx2406 commented Apr 4, 2026 •

edited by alexey-milovidov

Loading

clickhouse-gh Bot commented May 12, 2026 •

edited

Loading