Skip writing all-default columns during MergeTree INSERT#98472
Conversation
facf467 to
92d5f86
Compare
|
@amosbird thank for letting me know! Indeed, I will take a look. |
There was a problem hiding this comment.
Pull request overview
Adds an INSERT-time optimization for MergeTree to avoid writing data streams/files for columns that are entirely type-default within an inserted block, reducing disk usage for sparse-update patterns.
Changes:
- Introduces
skip_empty_columns_on_insertMergeTree setting and applies it inMergeTreeDataWriter::writeTempPartImplby filtering all-default columns from the part’s written column list. - Adds
IColumn::hasOnlyDefaults()(with implementations/overrides for several column types) to efficiently detect all-default columns. - Adds a new stateless test and reference output covering several correctness scenarios around missing columns.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/Storages/MergeTree/MergeTreeDataWriter.cpp |
Filters all-default columns before constructing MergedBlockOutputStream; toggles reset_columns when filtering occurred. |
src/Storages/MergeTree/MergeTreeSettings.cpp |
Documents new MergeTree setting skip_empty_columns_on_insert. |
src/Core/SettingsChangesHistory.cpp |
Records the new setting in settings change history. |
src/Columns/IColumn.h / src/Columns/IColumn.cpp |
Adds and implements (via IColumnHelper) the new hasOnlyDefaults() API. |
src/Columns/ColumnVector.h |
Adds a fast-path hasOnlyDefaults() using memoryIsZero. |
src/Columns/ColumnConst.h / ColumnFixedString.h / ColumnDecimal.h |
Adds hasOnlyDefaults() overrides for common fixed-size representations. |
src/Columns/ColumnLazy.h/.cpp, ColumnUnique.h, ColumnCompressed.h, ColumnBLOB.h, ColumnFunction.h, ColumnAggregateFunction.h, IColumnDummy.h |
Adds required hasOnlyDefaults() overrides/implementations (some throw / return conservative defaults). |
tests/queries/0_stateless/04006_skip_empty_columns_on_insert.sql |
New stateless test cases for the optimization. |
tests/queries/0_stateless/04006_skip_empty_columns_on_insert.reference |
Expected output for the new stateless test. |
| if (!empty_columns.empty()) | ||
| { | ||
| auto filtered = columns.eraseNames(empty_columns); | ||
| if (!filtered.empty()) | ||
| { | ||
| columns = std::move(filtered); | ||
| has_empty_columns = true; | ||
| for (const auto & name : empty_columns) | ||
| infos.erase(name); | ||
| } |
There was a problem hiding this comment.
The filtering logic won’t skip any columns when all columns in the block are defaults: columns.eraseNames(empty_columns) will return an empty list, and the if (!filtered.empty()) guard prevents applying the filter at all. This contradicts the feature’s intent (skip all-default columns) and keeps writing unnecessary files for fully-default inserts. Consider keeping at least one “anchor” column (e.g., first physical column or a key column) and removing the rest, and update the test case to assert the reduced on-disk columns set.
| const auto & columns_description = metadata_snapshot->getColumns(); | ||
| NameSet empty_columns; | ||
| for (const auto & col : block) | ||
| { | ||
| auto col_default = columns_description.getDefault(col.name); | ||
| if (col_default && col_default->expression) | ||
| continue; | ||
| if (col.column->hasOnlyDefaults()) | ||
| empty_columns.insert(col.name); | ||
| } |
There was a problem hiding this comment.
empty_columns is collected by iterating over every column in block. At this point columns is getAllPhysical().filter(block.getNames()), while block can also contain temporary/subcolumns added for sorting key / skip-index expressions. Those non-physical columns can be marked “empty” even though they will never be written, which can cause has_empty_columns to become true without actually removing any physical columns and trigger reset_columns unnecessarily. Consider iterating over columns (and looking up each in block) instead of iterating over block.
| bool hasOnlyDefaults() const override | ||
| { | ||
| return memoryIsZero(data.data(), 0, data.size() * sizeof(T)); | ||
| } |
There was a problem hiding this comment.
ColumnVector<T>::hasOnlyDefaults() uses a raw memoryIsZero check, but isDefaultAt() uses data[n] == T{}. For floating-point types, -0.0 == 0.0 is true, so a column containing only -0.0 values is “all-default” per isDefaultAt() but will return false here (bytes are not all zero). This makes hasOnlyDefaults() inconsistent with the column’s own default semantics. Consider falling back to an isDefaultAt() loop for floating-point T (or otherwise aligning the definition).
Fgrtue
left a comment
There was a problem hiding this comment.
I wanted to suggest to add custom hasOnlyDefaults() implementation for the following columns:
- ColumnString -- if I understand correctly, we could just check that the offsets are all 0
- ColumnNull -- probably
memoryIsBytecould have been used? - ColumnArray -- likely we could also check that offsets are empty (equal to 0), as in case of ColumnString
- ColumnSparse -- it seems that only checking the offsets would be enough
- ColumnTuple -- at the moment we will call
isDefaultAt()NxM times irrespectively if the column inside the tuple have optimized custom version ofhasOnlyDefaults(). If we rewritehasOnlyDefaults()method to just propagating the call to the columns that are stored in tuple, we might get some performance improvement.
08e6a6d to
ec0cdad
Compare
|
@Fgrtue It seems the CH Inc sync requires manual resolution. |
|
@amosbird should be done. I wanted to take a second quick look at the PR today, I will update you on the results. Just to make sure, that's the final version so far, right? |
Yes. (I mistakenly configured Copilot to force push, which appears to have overridden the existing reviews. Sorry about that.) |
|
@amosbird, did you have a chance to see my previous review suggestion about adding optimized |
|
Thanks for the tips! I've added optimized
Test cases 5, 8, 9, and 10 exercise |
| (data.supportsTransactions() && context->getCurrentTransaction()) ? context->getCurrentTransaction()->tid : Tx::PrehistoricTID, | ||
| block.bytes(), | ||
| /*reset_columns=*/ false, | ||
| /*reset_columns=*/ has_empty_columns, |
There was a problem hiding this comment.
I am trying to understand whether we need to set reset_columns to true in case we have empty columns.
So far I found the following that we use reset_columns in three contexts:
- MergedBlockOutputStream.cpp:40
- MergedBlockOutputStream.cpp:221
- MergedBlockOutputStream.cpp:442
In two of them (on line 40 and 442) it seems that we won't get any new information in infos. The third one (221) I did not verify completely (I will), but on the first glance it looks that reset_columns doesn't influence that part as well.
Could you check please if it is needed and why?
There was a problem hiding this comment.
You are right — reset_columns = true is not needed here. I traced through all three sites:
-
Constructor (
IMergedBlockOutputStream.cpp:40-41): Initializesnew_serialization_infos = SerializationInfoByName(columns_list, info_settings)— but noteinfo_settingshaschoose_kind = false(line 32), while we already computed the real serialization info withchoose_kind = trueatMergeTreeDataWriter.cpp:875and set it vianew_data_part->setColumns(columns, infos, ...)at line 893. -
writeImpl(MergedBlockOutputStream.cpp:442-443):new_serialization_infos.add(block)— accumulates stats from the written block, but this is the same block we already raninfos.add(block)on at line 882. So it just recomputes equivalent statistics. -
finalizePartAsync(MergedBlockOutputStream.cpp:221-231): This does three things:serialization_infos.replaceData(new_serialization_infos)— replaces only thedatamember (notkind_stack) with equivalent stats from step 2.removeEmptyColumnsFromPart(new_part, part_columns, new_part->expired_columns, ...)— this is a no-op becauseexpired_columnsis empty. It is only populated in the merge path (MergeTask.cpp:610-626) for TTL-expired columns, never during INSERT.new_part->setColumns(part_columns, serialization_infos, ...)— redundant, since we already calledsetColumnswith the correct filteredcolumnsandinfosat line 893.
So indeed the entire reset_columns block is a no-op in our INSERT path. I will change it to false.
Fgrtue
left a comment
There was a problem hiding this comment.
The tests are good. I wanted to suggest to add the following test cases:
- Testing for compact parts (i.e. use min_bytes_for_wide_part != 0, min_rows_for_wide_part = 0) for a) skipping column b) merging parts
- Could we also add a check for LowCardinality column as this is a pretty often use case?
- Regarding merges test, what do you think if we test both type of merges: vertical and horizontal ones?
|
|
||
| bool ColumnSparse::hasOnlyTypeDefaults() const | ||
| { | ||
| return _size == 0 || getOffsetsData().empty(); |
There was a problem hiding this comment.
I am thinking of a case when sparce column consists only of of non-default type (for example 5). It seems to me that we will not distinguish between sparce column with just one (any) value.
Moreover, the generic version IColumnHelper<Derived, Parent>::hasOnlyTypeDefaults() seems to give a wrong result as well.
Even though now it does not lead to data corruption (i.e. returning type defaults instead of the DEFAULT elements at values[0]), it seems that this is a wrong implementation for this function.
If my reasoning is right, we could fix this by checking that the element at 0 index of values is default type itself, i.e values->isDefaultAt(0).
There was a problem hiding this comment.
Good catch! Fixed: added && values->isDefaultAt(0) so we verify the actual stored default value, not just the absence of offsets. The generic IColumnHelper::hasOnlyTypeDefaults() fallback also gives wrong results for ColumnSparse (since isDefaultAt(n) just checks getValueIndex(n) == 0), but the specialized override now handles this correctly.
| { | ||
| addSettingsChanges(merge_tree_settings_changes_history, "26.4", | ||
| { | ||
| {"skip_empty_columns_on_insert", false, false, "New setting to skip writing all-default columns on INSERT"}, |
There was a problem hiding this comment.
Probably it would be more accurate way to say:
| {"skip_empty_columns_on_insert", false, false, "New setting to skip writing all-default columns on INSERT"}, | |
| {"skip_empty_columns_on_insert", false, false, "New setting to skip writing all type default columns on INSERT"}, |
| ORDER BY column; | ||
|
|
||
| SELECT 'case7_data'; | ||
| SELECT key, a, b FROM t_skip_empty_default_expr ORDER BY key; |
There was a problem hiding this comment.
Should we SELECT key, a, b, c here? Or is this intentional?
There was a problem hiding this comment.
Added c to the SELECT. The reference now shows 1 5 0 50, confirming the MATERIALIZED expression a * 10 is correctly evaluated.
|
Added 4 new test cases addressing the review:
|
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
|
The |
CurtizJ
left a comment
There was a problem hiding this comment.
This introduces an inconsistency: if a user inserts a column with all-zero values and then changes its default expression, the values returned on read will change as well. The same inconsistency already exists for columns added via ADD COLUMN whose default expression is later modified, so this is not a new problem, but it may be worth avoiding in this case.
Maybe we can store a marker in serialization_infos.json that records whether the column was physically written, so the reader can fill in defaults correctly?
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
…ERT and merge When all values of a named-Tuple subfield in a part are type-defaults, the writer omits that subfield's stream files and narrows the part's columns.txt Tuple type so it no longer mentions the subfield. Reads see the narrowed Tuple type and use `CAST(narrowed_tuple, full_tuple)` to materialize defaults, relying on the metadata-only ALTER work in ClickHouse#107305. This optimization is most useful for `PARTITION BY` schemes where different partitions populate different subsets of a wide schema's subfields: the on-disk part keeps only the substreams whose subfield actually appears in that partition. Approach - Reuse the existing whole-column pruning path (`IMergedBlockOutputStream::removeEmptyColumnsFromPart` consuming `new_data_part->expired_columns`). - Extend that path to accept dotted subfield names (`data.c2s.gold`) and narrow the column's Tuple type via the new `narrowDataTypeByExpiredSubstreams` helper in `DataTypes/Utils`. - After the prune pass, keep `columns_substreams.txt` consistent with the on-disk files via the new `ColumnsSubstreams::removeSubstreams` helper. - Preserve each kept subfield's `SerializationInfo` (its sparse / default kind and per-element `num_rows` / `num_defaults`) when narrowing the enclosing Tuple, via the new `narrowSerializationInfo` helper. - INSERT: `MergeTreeDataWriter` traverses each named-Tuple column with the new `IColumn::hasOnlyTypeDefaults` to spot all-default subtrees and contributes their dotted paths to `expired_columns`. - Merge (Sub-case A): `MergeTask::prepare` computes the union of leaf substreams across all source parts and marks any leaf absent from every source as expired in the merged part. This is monotonic: a merged part never re-materializes default values for a subfield that was consistently pruned in the inputs. Why top-level all-default columns are intentionally NOT pruned If we erased a top-level Tuple column whose value is entirely default, the part would semantically lose that column ("missing column" — equivalent to a column that was added by a later `ALTER ADD COLUMN`). A subsequent `ALTER MODIFY COLUMN ... DEFAULT <new_expr>` would then re-materialize the column with the NEW default expression on read, retroactively changing historical data. That is exactly the quirk tracked by ClickHouse#92475 (`ALTER MODIFY ... DEFAULT` rewriting old parts). This PR sidesteps the problem by leaving top-level columns alone: subfield pruning only narrows the Tuple type of a column that still exists. The materialized 0 / '' / `[]` bytes of the kept columns pin the part's semantics; future `ALTER MODIFY ... DEFAULT` changes apply only to parts written after the ALTER, matching today's whole-column behavior. Named-Tuple subfields have no per-subfield DEFAULT expression syntax (`Tuple(a Int64 DEFAULT 5)` is not a valid type), so pruning a subfield can only ever fall back to the language's type-default (0 / '' / NULL). This is also why the optimization composes cleanly with the per-column DEFAULT RFC in ClickHouse#92475 (comment 4334850399): subfield pruning operates entirely below the column boundary the RFC will redefine. What is NOT touched - Compact parts: early return preserved; pruning only fires for Wide parts. - Patch parts: skipped (mirrors the existing whole-column behavior). - Mutate path: not pruned; mutations preserve the existing schema. - Top-level all-default columns: see note above. - `PR ClickHouse#98472`'s column-level `skip_empty_columns_on_insert` mechanism: only the `hasOnlyTypeDefaults` column primitives are lifted, none of its signalling layer (no `WITH_SKIPPED_COLUMNS` serialization version, no JSON `skipped_columns` field, no DEFAULT-expression interaction). Gate - `enable_tuple_subfield_pruning` (default true) gates the entire feature in `MergeTreeSettings`. The history entry is recorded under 26.6. Compatibility - No on-disk format change: parts written by this PR are readable by any server that has the metadata-only-ALTER work in ClickHouse#107305. Tests - `tests/queries/0_stateless/04320_tuple_subfield_pruning.sql` exercises 36 cases: flat / nested Tuple, Nullable wrap, Array(Tuple) (all-empty and non-empty), Map(K, Tuple), `LowCardinality(String)`, deep customer-like schema, `PARTITION BY` per-partition narrowing, setting OFF, Compact-part preservation, two-part merge variants (both pruned, one pruned, different subfields pruned), `INSERT SELECT` / async INSERT / materialized view, `ReplacingMergeTree` merge, vertical merge, `LWD`, `ALTER MODIFY ADD subfield + INSERT`, `ALTER UPDATE` mutation on narrowed part, multi-granule part, `DETACH / ATTACH PARTITION`, top-level column with a dot in its name, force-sparse + pruning interaction, subcolumn reads of pruned subfields, `CHECK TABLE` on a pruned part, and `bytes_on_disk` comparison. ### Documentation entry for user-facing changes - [x] Documentation is not required. ### Changelog category (leave one): - Improvement ### Changelog entry: Automatically prune named-Tuple subfields whose values in a part are entirely type-defaults: the writer omits their stream files and records a narrowed Tuple type in `columns.txt`; reads materialize defaults via `CAST`. Gated by the new MergeTree-level setting `enable_tuple_subfield_pruning` (default on).
… parent DB is missing `DataLakeConfiguration::getCatalog` (introduced by ClickHouse#100334) looked up the parent database in `DatabaseCatalog` and threw `LOGICAL_ERROR` ("Database X not found") when `tryGetDatabase` returned `nullptr`. That assertion is wrong: a missing database here is a transient runtime state, not a logical-invariant violation. Concretely it can fire during async metadata loading after a server restart (`AsyncLoader::worker` -> `DatabaseOrdinary::loadTableFromMetadata` -> `createStorageObjectStorage` -> `getCatalog`) when an unrelated table-load job in the same database has just thrown (for instance because of `cannot_allocate_thread_fault_injection_probability`) and the database has been detached as a result. Stress tests with thread-allocation fault injection have been hitting this LOGICAL_ERROR sporadically: `STID 2377-2a78`, 3 distinct unrelated PRs over 90 days (PR ClickHouse#98472 on 2026-04-09, PR ClickHouse#100958 on 2026-04-12, PR ClickHouse#102804 on 2026-04-30 - none of which touch this code or its callers). Production stack from PR ClickHouse#102804 stress-test (amd_debug): ``` 2026.04.30 05:17:39.895829 [ 6955 ] AsyncLoader::worker: Code: 439. DB::Exception: Cannot schedule a task: fault injected (...): Cannot attach table `test_1`.`test_max_size_drop` from metadata file ... 2026.04.30 05:17:40.099425 [ 6998 ] {} <Fatal> : Logical error: 'Database test_1 not found'. [stack: DataLakeConfiguration::getCatalog -> createStorageObjectStorage -> registerStorageIceberg -> StorageFactory::get -> createTableFromAST -> DatabaseOrdinary::loadTableFromMetadata -> AsyncLoader::worker] ``` Fix: combine the two null-checks. `dynamic_pointer_cast` already returns `nullptr` for a null input, so the function naturally returns `nullptr` both for "DB not registered" and "DB is not `DataLakeCatalog`" - the same response either way. This matches the behaviour of `getCatalog` before ClickHouse#100334, restores backward compatibility for `Iceberg` engine tables hosted in regular `Atomic`/`Ordinary` databases, and removes the spurious LOGICAL_ERROR signal from stress-test reports without changing behaviour for the supported `DataLakeCatalog` -> `Iceberg` path. Local verification (debug build): - Compiles, server starts. - `CREATE TABLE iceberg_t ENGINE = IcebergLocal(...)` inside a regular `Atomic` database succeeds, DETACH/ATTACH database cycle succeeds, server restart with `async_load_databases=1` reloads the table without LOGICAL_ERROR. Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=102804&sha=40e4eba7d14b8588106464e81b911e8de7a45dc6&name_0=PR&name_1=Stress%20test%20%28amd_debug%29 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| continue; | ||
| /// Column is materialized by this mutation (present in updated_header), | ||
| /// so it is written in full and is no longer skipped. | ||
| if (updated_header.has(new_name)) |
There was a problem hiding this comment.
A skipped column still represents real inserted values, so preserving its marker through every mutation that leaves it out of updated_header is unsafe for type-changing ALTER MODIFY COLUMN.
Concrete trace: insert b UInt64 = 0 so b is skipped, then run ALTER TABLE ... MODIFY COLUMN b Nullable(UInt64). splitAndModifyMutationCommands skips the READ_COLUMN command because the source part has no physical b file, so updated_header does not contain b here and the marker is preserved. Later reads synthesize the current type default, NULL, but a normal type mutation of the stored value 0 should produce 0 as Nullable(UInt64).
Please either materialize skipped columns for type-changing READ_COLUMN mutations, or only preserve the marker when the old type-default converted to the new type is provably equal to the new type default. A regression with UInt64 -> Nullable(UInt64) would catch this.
There was a problem hiding this comment.
The on-the-fly window is now guarded, but the fully-materialized type change can't be fixed by dropping the marker: both the marker and normal missing-column handling yield the new type's default (e.g. NULL), while the correct value is convert(0, new_type) = 0. Only materializing the skipped column produces it. Root cause: the type-changing READ_COLUMN is filtered out at MutateTask.cpp:559 for physically-absent (skipped) columns, so it never reaches getColumnsForNewDataPart. Left as a design decision (force materialization vs. record skip-time type/value) — see #98472 (comment)
| /// unrelated mutation silently dropped the marker. | ||
| { | ||
| NameSet new_skipped_columns; | ||
| for (const auto & name : serialization_infos.getSkippedColumns()) |
There was a problem hiding this comment.
CLEAR COLUMN needs to forget a skipped column too. Today a skipped column is absent from part_columns, so the DROP_COLUMN command with clear = true is ignored before it reaches the interpreter, and this loop then preserves the old skipped marker because updated_header does not contain the column.
Concrete trace: insert b = 0 so b is skipped, then ALTER TABLE ... MODIFY COLUMN b UInt64 DEFAULT 999, then ALTER TABLE ... CLEAR COLUMN b. CLEAR COLUMN should remove the stored value and make the row read as the current default 999, but preserving the marker keeps returning the inserted type-default 0.
Please treat DROP_COLUMN with clear = true as affecting skipped columns even when they have no physical files: either drop the marker so the current default is evaluated, or materialize the cleared value explicitly.
There was a problem hiding this comment.
On-the-fly CLEAR COLUMN is now handled on read via isColumnDropped (66cd944). The fully-materialized case still needs the clear DROP_COLUMN to reach getColumnsForNewDataPart (it is filtered at MutateTask.cpp:566 for skipped columns), which is the same materialization decision as the type-change blocker — see #98472 (comment)
|
Pushed
These guards are scoped to the Still open — needs a design decision (Blocker 1, and the fully-materialized case of Blocker 4): The root cause is in For Blocker 1 this cannot be fixed by dropping the marker: after skipped I could not run a full local build (this checkout's |
| continue; | ||
| /// Column is materialized by this mutation (present in updated_header), | ||
| /// so it is written in full and is no longer skipped. | ||
| if (updated_header.has(new_name)) |
There was a problem hiding this comment.
A skipped column still represents real inserted values, so MATERIALIZE COLUMN must not treat the missing physical file as "value was never stored". Today splitAndModifyMutationCommands admits MATERIALIZE_COLUMN when the part lacks column files; for a skipped b this makes MutationsInterpreter compute the current DEFAULT expression, updated_header.has(b) is true here, and this block drops the skipped marker. A concrete trace is: insert b UInt64 = 0 with skipping enabled, ALTER MODIFY COLUMN b UInt64 DEFAULT 999, then ALTER TABLE ... MATERIALIZE COLUMN b. Before materialization reads return the inserted 0, but the mutation writes 999, violating the existing MATERIALIZE COLUMN contract that past values for DEFAULT columns are not overwritten. Please either keep skipped columns out of this materialization path or materialize the value read through the skipped marker, and add a regression for this sequence.
| /// ... DEFAULT 999, the newly added `b` must read 999, not the | ||
| /// frozen default. Fall through to normal missing-column handling | ||
| /// (which evaluates the DEFAULT expression) in that case. | ||
| if (alter_conversions->isColumnDropped(name_in_part)) |
There was a problem hiding this comment.
This drop guard checks the old physical name after the rename mapping, but AlterConversions records DROP COLUMN under the current name. For a part with missing_columns = ['b'], pending RENAME COLUMN b TO c, then DROP COLUMN c, name_in_part becomes b, isColumnDropped("b") is false, and the stale marker is trusted. After ADD COLUMN c UInt64 DEFAULT 999, the new c can read as the old inserted type-default 0. MergeTask has the same ordering before it translates the marker to the current name. Please check the dropped state both before and after rename, or normalize missing-marker names through the full rename/drop chain before preserving or trusting them; add a rename -> drop -> add regression.
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 275/321 (85.67%) · Uncovered code |
3e5529b to
0aa7d4a
Compare
When skip_empty_columns_on_insert is enabled and serialization_info_version is set to 'with_missing_columns', columns whose values are entirely type-defaults (zeros, empty strings, NULLs) are omitted from MergeTree parts at INSERT time. A structured 'missing_columns' marker in serialization.json records the frozen default for each omitted column, so a later ALTER MODIFY COLUMN ... DEFAULT does not retroactively change the inserted values. Key components: - IColumn::hasOnlyTypeDefaults() with optimized overrides (memoryIsZero, offsets check, null map check, etc.) - skipEmptyColumnsOnInsert() in MergeTreeDataWriter filters columns using IDataType::getDefault() to match the read-path reconstruction - SerializationInfoByName::MissingColumnInfo struct with TypeDefault and Expression (reserved for Phase 2) variants - Read path (fillMissingColumns) consults the marker and fills type-default instead of evaluating the current DEFAULT expression - Marker propagation through merges, mutations, and ALTER RENAME COLUMN - Version gate: WITH_MISSING_COLUMNS = 2 in serialization_info_version Fixes: - Date32 type-default mismatch (getDefault() vs insertDefaultInto()) - Compact-part rename tracking for missing columns - Expression markers rejected on read (fail closed until Phase 2) Tests: - 83 stateless test labels across 3 files (types, mutations, lifecycle) - 6 integration tests (replication, mixed-version, restart, partitions)
0aa7d4a to
fc4513a
Compare

Summary
During MergeTree INSERT, columns whose values are entirely type-defaults (e.g., all zeros for
UInt64, all empty strings forString, allNULLs forNullable) are detected and excluded from the part's column list before constructingMergedBlockOutputStream. This avoids writing unnecessary.binfiles (Wide parts) or data streams (Compact parts), saving disk space for sparse-update workloads where most columns in each INSERT are left at their type's default. The optimization is opt-in via the MergeTree settingskip_empty_columns_on_insert(off by default). It additionally requiresserialization_info_versionto be set towith_missing_columns(the format version that records frozen defaults for missing columns), so that a cluster pinned to a lower version for compatibility never writes parts that older servers cannot read.The block itself is passed intact to the writer, so skip indices, projections, primary index, and min-max index are all computed from the full data. Reading a part that lacks a column fills it with type-defaults automatically — the same mechanism used by
ALTER TABLE ADD COLUMNon existing parts.To keep reads stable, a structured
missing_columnsarray is recorded in the part'sserialization.json(a newWITH_MISSING_COLUMNSserialization-info version). Each entry carries the column name and atype_defaultmarker. On read,fillMissingColumnsconsults this marker and fills a missing column with its type-default even if the column later gains a newDEFAULTexpression, so that a subsequentALTER MODIFY COLUMN ... DEFAULTdoes not retroactively change the values that were actually inserted. The marker is propagated through merges, mutations, and on-the-flyALTER RENAME COLUMN, so the inserted type-defaults survive part-lifecycle operations.The
MissingColumnInfostruct also reserves aDefaultKind::Expressionvariant for future use (issue #92475:ALTER MODIFY DEFAULTfreezes old expression into parts). This PR only writestype_defaultmarkers; reading anexpressionmarker throwsCORRUPTED_DATAto fail closed until Phase 2 implements expression evaluation.Related: #4968, #92475
On-disk format (
serialization.json):{ "missing_columns": [ { "name": "b", "default": "type_default" } ], "version": 2 }Changes:
skip_empty_columns_on_insert(defaultfalse).IColumn::hasOnlyTypeDefaultswith optimized overrides forColumnVector/ColumnDecimal(memoryIsZero),ColumnString/ColumnArray(zero offsets),ColumnNullable(all-1null map),ColumnSparse(no stored non-default values), andColumnTuple(delegates to sub-columns).MergeTreeDataWriter::writeTempPartImpl(skipEmptyColumnsOnInsert). Columns with aDEFAULT/MATERIALIZED/ALIASexpression are never skipped. A column is skipped only whenIDataType::getDefault()coincides with the column's zero representation (isDefaultAt(0)on a column filled viagetDefault()), which correctly excludes types likeDate32(type-default 1900-01-01 ≠ memory-zero) andEnum(first declared value ≠ 0). Patch parts are excluded, and at least one physical column is always kept.serialization_info_version >= with_missing_columns. The populating step is authoritative about the part format version, soSerializationInfoByName::getVersionnever silently upgrades a part past the configured value (which would make older servers reject it during a rolling upgrade).serialization.jsonviaSerializationInfoByName(newWITH_MISSING_COLUMNSversion). Onlytype_defaultmarkers are written;expressionmarkers are rejected on read until Phase 2. The list is written in sorted order so that identical parts produce identical checksums on different replicas.MergeTask, for columns that end up absent from the merged part), through mutations (MutateTask::getColumnsForNewDataPart, including renames of missing columns), through compact-part renames (splitAndModifyMutationCommands), and through on-the-flyALTER RENAME COLUMNon read (IMergeTreeReader::fillMissingColumnstranslates the requested name back throughalter_conversions).Nullable; key columns;DEFAULTexpression not skipped;Array;Tuple;ColumnSparsesource; compact parts;LowCardinality; stable values acrossALTER ... DEFAULT; a non-zero-defaultEnum; marker across mutation/merge/rename afterALTER DEFAULT; version gate; DETACH/ATTACH TABLE; DETACH/ATTACH PARTITION; BACKUP/RESTORE; FREEZE; CHECK TABLE; MATERIALIZE COLUMN; CLEAR COLUMN; lightweight DELETE; chained mutations; INSERT SELECT; REPLACE PARTITION; ATTACH PARTITION FROM; ALTER ADD/DROP COLUMN; projections; FixedString; Map; mixed parts merge; Date/DateTime; pre-ADD-COLUMN + skip merge; type-changing mutation; compact-part rename+mutation; Date32 regression.test_skip_empty_columns) with 6 cases: replicated consistency; merge marker propagation across replicas; mixed-version version gate; backward-compat fallback for old parts; restart durability; REPLACE PARTITION across replicated tables.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
MergeTree INSERT can now skip writing columns whose values are entirely type-defaults (zeros, empty strings,
NULLs), saving disk space for sparse-update workloads. Enabled by the MergeTree settingskip_empty_columns_on_inserttogether withserialization_info_version = 'with_missing_columns'. Missing columns carry frozen defaults inserialization.json, so a laterALTER MODIFY COLUMN ... DEFAULTdoes not retroactively change the inserted values.Documentation entry for user-facing changes