findepi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1816204620
########## format/puffin-spec.md: ########## @@ -123,6 +123,49 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. +#### `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions, but is optimized for cases where +most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps. +64-bit positions are divided into a 32-bit "key" using the most significant 4 +bytes and a 32-bit sub-position using the least significant 4 bytes. For each +key in the set of positions, a 32-bit Roaring bitmap is maintained to store a +set of 32-bit sub-positions for that key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, little-endian Review Comment: This is interesting perspective. If indeed we can generate Delta metadata on top of these newly-written Puffin files and feed that to old Delta readers without any modifications to those old readers, there's certainly some value in it. @rdblue this discussion is interesting and compatibility capabilities are _potentially_ useul. Why isn't there "Delta" word in the spec and only one ([unclear](https://github.com/apache/iceberg/pull/11238/files#r1796159961)) occurrence of "compatibility"? wouldn't it be useful for better understanding of a) why it's the way it is and b) what a user can do thanks to this? Without this the spec is complete (context isn't needed for completeness), but confusing. There are some redundant fields (eg length) and a spec implementor may spend some time wondering eg whether length needs to be validated or ignored. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org