25.8 Antalya ports: Read optimization using Iceberg metadata#1069
Conversation
| /// Delta lake related object metadata. | ||
| std::optional<DataLakeObjectMetadata> data_lake_metadata; | ||
| /// Information about columns | ||
| std::optional<DataFileMetaInfoPtr> file_meta_info; |
There was a problem hiding this comment.
why optional shared_ptr? What would set null-value mean in this case and how is it different from unset value?
| std::map<size_t, ConstColumnWithValue> constant_columns_with_values; | ||
| std::unordered_set<String> constant_columns; | ||
|
|
||
| NamesAndTypesList requested_columns_copy = read_from_format_info.requested_columns; |
There was a problem hiding this comment.
Can we avoid making a copy here, and use original columns for building requested_columns_list ? I'm Ok with moving that copying down, right before we actually modify it
| requested_columns_list[column.getNameInStorage()] = std::make_pair(column_index++, column); | ||
| } | ||
|
|
||
| if (context_->getSettingsRef()[Setting::allow_experimental_iceberg_read_optimization]) |
There was a problem hiding this comment.
It would be nice to have a general description of this optimization, as a comment to the whole section.
| if (context_->getSettingsRef()[Setting::allow_experimental_iceberg_read_optimization]) | ||
| { | ||
| auto file_meta_data = object_info->getFileMetaInfo(); | ||
| if (file_meta_data.has_value()) |
There was a problem hiding this comment.
again, this is an excellent example: file_meta_data.value() can be null here. I know that there are precautions now, but this is fragile, IMO simple std::shared_ptr would be more robust here.
| if (file_meta_data.has_value()) | |
| if (file_meta_data) |
and
file_meta_data.value()->columns_info
would be replaced by just
file_meta_data->columns_info
Antalya-specific exception: allow_experimental_iceberg_read_optimization
6c8fbaf to
6777073
Compare
There was a problem hiding this comment.
Why do we need this? Why cannot we use Iceberg::ColumnInfo? This is ambiguity with no obvious reason. We have local ColumnInfo (with almost the same subset of fields), and forward-declared another ColumnInfo, and then use them all together. Why?
…wipes_my_snot 25.8 Antalya: Fixes for #1069
…e_count_in_datalake 25.8 Antalya ports: Read optimization using Iceberg metadata
…wipes_my_snot 25.8 Antalya: Fixes for #1069
When an Iceberg manifest carries no per-column statistics, the parsed `DataFileMetaInfo::columns_info` is empty. The read optimization in `StorageObjectStorageSource::createReader` misread this as "every requested column is absent from the file": it replaced each nullable column with a constant `NULL` and set `need_only_count`, so the reader returned correct row counts but all-NULL values — silent data loss. Gate the absent-column-to-NULL loop on a non-empty `columns_info` so that stats-less manifests fall through to the regular reader, which reads present columns normally and resolves schema-evolution-absent columns via `IcebergMetadata::getInitialSchemaByPath`. Affects `icebergLocal`, `icebergS3`, `icebergAzure`, `icebergHDFS` and their `*Cluster` variants. Antalya-only, introduced by #1069. Add stateless test `04302_iceberg_read_optimization_no_column_stats` with a checked-in stats-less Iceberg fixture and a `generate.py` that reproduces it. C++ change taken from #1895 Closes #1545 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Read optimization using Iceberg metadata
Port of #1019 with many changes
Documentation entry for user-facing changes
Diff with #1019
Exclude tests: