iframe-proxy

amosbird · 2026-06-12T07:42:58Z

Each subfield of a named Tuple is stored as a separate stream named by the field path (e.g. data.c2s.statistics.heros_statistics.damage.bin), not by position. Adding a new subfield at any position only introduces new stream files; all preexisting subfield streams are name-stable and remain valid as-is.

Today, ALTER MODIFY COLUMN <col> Tuple(..., new_field ...) on a named Tuple triggers a full data mutation that rewrites every part. On a 2M-row / 1.3 GiB Wide part with a deeply nested customer schema this took ~9 s; with this change it completes in ~12 ms without producing a Mutation.

This works because:

Subcolumn reads of new subfields fall back to the type's default via existing fillMissingColumns.
Whole-tuple reads work via existing CAST(named-Tuple → named-Tuple-superset), which matches by name and fills missing elements with defaults.
Background merges materialize the new subfield streams for old parts transparently.

isMetadataOnlyConversion is extended to recognize named-Tuple subfield additions, recursing through Array / Nullable / Map / Tuple. Every old subfield must still be present by name with a metadata-only-compatible type; new subfields can be added at any position. Removing/renaming subfields or changing an existing subfield's type in a non-metadata-only way still requires a mutation. Map recursion is added in the same change.

Test 04319_named_tuple_metadata_only_alter covers: append at end/middle, multiple sequential ALTERs, nested Tuple, Array(Tuple), Map(K, Tuple), Nested with flatten_nested=0, deeply nested Array(Tuple) inside Tuple(Tuple(...)), Nullable and non-Nullable new subfields, customer schema reproducer, merge materialization, plus reject cases (unnamed Tuple, subfield removal, rename, incompatible type change, stream-name collision).

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

ALTER TABLE ... MODIFY COLUMN <col> Tuple(...) on a named Tuple is now metadata-only when only adding subfields (in any position), matching the speed of top-level ADD COLUMN. The recursion also covers Array, Nullable, Map, and nested Tuple wrappers.

Documentation entry for user-facing changes

Documentation is not required (behavioral improvement; semantics unchanged)

clickhouse-gh · 2026-06-12T07:43:34Z

Run key, index, and projection safety checks for any `MODIFY_COLUMN` that changes the column's type, regardless of whether the change is metadata-only or requires a mutation. Previously these checks were gated behind `isRequireMutationStage`, so metadata-only conversions (e.g. adding a subfield to a named `Tuple` whose subcolumn appears in `ORDER BY`) bypassed them, leaving stale `primary.idx` / partition key bytes that no longer match the new tuple arity. This was the blocking AI review finding on ClickHouse#107305. Add regression tests covering: - a subcolumn of the modified Tuple appearing in `ORDER BY` - a subcolumn appearing in `PARTITION BY` - the whole Tuple column appearing in `ORDER BY` (`primary.idx` arity would mismatch — caught by `isSafeForKeyConversion`) - sanity: a Tuple not in any key is still metadata-only ALTER-able

…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).

EmeraldShift · 2026-06-26T13:41:40Z

+1 -- We just discovered this is very valuable for e.g. continuous profiling data, because we store the call stack like Array(Tuple(function_name String, address UInt64, ...)) and adding more metadata to each frame is very painful.

…fields

clickhouse-gh · 2026-06-27T12:43:22Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 92/149 (61.74%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 4 line(s) · Uncovered code

Full report · Diff report

clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label Jun 12, 2026