moomindani opened a new pull request, #3474: URL: https://github.com/apache/iceberg-python/pull/3474
Part of #2261. Continues #2822. # Rationale for this change This adds a `PuffinWriter` for writing Puffin files containing `deletion-vector-v1` blobs — the first building block for deletion-vector write support in PyIceberg (tracking issue #2261). It revives #2822 by @rambleraptor (with @glesperance's Spark interop test), which was auto-closed by the stale bot rather than on merit. The original work — including all review feedback already addressed there (@ebyhr, @geruh) — is preserved commit-for-commit. On top of that, this PR adds unit tests for two agreed review items that were not yet asserted by any test: - the blob `fields` value `[2147483645]` (Java `MetadataColumns.ROW_POSITION`, INT_MAX - 2), required for Java/Spark interoperability; and - the deletion-vector blob framing at the byte level (length prefix, DV magic, CRC-32 over magic + vector), which the `PuffinFile` reader skips, so the round-trip tests did not previously exercise it. As in the original PR, this is intentionally scoped to the writer + tests so we can agree on the write semantics before wiring it into the delete/manifest writers and the merge-on-read path. Per the original review discussion, the writer expects the caller to provide one merged deletion vector per data file. ## Are these changes tested? Yes: - Unit tests for round-trip write/read, the single-blob (1:1) behavior, the DV field id, byte-level blob framing, and empty files (`tests/table/test_puffin.py`). - A Spark interoperability test confirming PyIceberg can read Spark-written Puffin DVs (`tests/integration/test_puffin_spark_interop.py`, by @glesperance). ## Are there any user-facing changes? No. `PuffinWriter` is a new internal building block and is not yet wired into any public write path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
