Summary
Add support for Apache Iceberg v3 deletion vectors in ClickHouse to enable correct and efficient querying of Iceberg v3 tables.
Iceberg v3 introduces binary deletion vectors stored as blobs in Puffin files, replacing position delete files with a compact bitmap representation of deleted row positions for each data file. This significantly reduces metadata overhead and read amplification for tables with frequent row-level updates and deletes.
Use Case
Supporting deletion vectors would allow ClickHouse to seamlessly query Iceberg v3 tables created by other engines while maintaining full compatibility with existing Iceberg read workflows. Spark and Trino perform row-level UPDATE, DELETE, and MERGE operations, generating deletion vectors by default for v3 tables. Without deletion vector support, ClickHouse cannot correctly read such tables, limiting interoperability with modern Iceberg deployments and preventing ClickHouse from serving as a high-performance analytics engine over shared Iceberg data.
Proposed Solution
Implement read/query support for Iceberg v3 deletion vectors as an initial phase. The implementation should:
- Parse Iceberg v3 metadata required to locate deletion vectors.
- Read deletion vector blobs from Puffin files using manifest-provided offsets and lengths.
- Decode the deletion vector bitmap and filter deleted rows during query execution.
- Apply deletion vectors together with existing position and equality deletes.
- Support distributed query execution without requiring write support.
Writing deletion vectors, row lineage, and other Iceberg v3 features can remain out of scope for the initial implementation. This phased approach provides immediate interoperability with Iceberg v3 while minimizing implementation complexity.
Summary
Add support for Apache Iceberg v3 deletion vectors in ClickHouse to enable correct and efficient querying of Iceberg v3 tables.
Iceberg v3 introduces binary deletion vectors stored as blobs in Puffin files, replacing position delete files with a compact bitmap representation of deleted row positions for each data file. This significantly reduces metadata overhead and read amplification for tables with frequent row-level updates and deletes.
Use Case
Supporting deletion vectors would allow ClickHouse to seamlessly query Iceberg v3 tables created by other engines while maintaining full compatibility with existing Iceberg read workflows. Spark and Trino perform row-level UPDATE, DELETE, and MERGE operations, generating deletion vectors by default for v3 tables. Without deletion vector support, ClickHouse cannot correctly read such tables, limiting interoperability with modern Iceberg deployments and preventing ClickHouse from serving as a high-performance analytics engine over shared Iceberg data.
Proposed Solution
Implement read/query support for Iceberg v3 deletion vectors as an initial phase. The implementation should:
Writing deletion vectors, row lineage, and other Iceberg v3 features can remain out of scope for the initial implementation. This phased approach provides immediate interoperability with Iceberg v3 while minimizing implementation complexity.