Support zstd blob decompression in Puffin by ebyhr · Pull Request #3575 · apache/iceberg-python · GitHub
Skip to content

Support zstd blob decompression in Puffin#3575

Draft
ebyhr wants to merge 1 commit into
apache:mainfrom
ebyhr:ebi/puffin-compression
Draft

Support zstd blob decompression in Puffin#3575
ebyhr wants to merge 1 commit into
apache:mainfrom
ebyhr:ebi/puffin-compression

Conversation

@ebyhr

@ebyhr ebyhr commented Jun 28, 2026

Copy link
Copy Markdown
Member

Rationale for this change

Prepares PuffinFile for apache-datasketches-theta-v1 blob support by removing the hard constraint that all blobs are deletion-vector-v1 and by adding zstd blob decompression.

Two fixes were required to make this work correctly:

Fix: blob offsets were off by 8

_payload stored puffin[8:], but blob offset values in the Puffin footer are measured from byte 0 of the file. For the existing deletion-vector-v1 case this was harmless — the 8-byte shift accidentally cancelled the 8-byte blob framing ([length:4][magic:4]) that was never stripped. With arbitrary blob types and real offsets the cancellation breaks, so _payload now holds the full file bytes.

Widen PuffinBlobMetadata.type from Literal["deletion-vector-v1"] to str

PuffinFile is a generic Puffin format parser, not a deletion-vector-specific reader. Locking type to a single literal means parsing any Puffin file that contains a non-DV blob (e.g., a theta sketch) raises a Pydantic validation error before we can even read the footer. Widening to str keeps the parser general; blob-type-specific logic belongs in callers (deletion_vectors_from_puffin_file, and the forthcoming theta sketch reader).

Are these changes tested?

Yes. Copied Puffin files from https://github.com/apache/iceberg/tree/main/core/src/test/resources/org/apache/iceberg/puffin/v1

Are there any user-facing changes?

No. PuffinFile is internal. The type: str widening is backwards-compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant