aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798622067
########## format/puffin-spec.md: ########## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. +#### `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible. Having the magic number allows us to cross check the serialization format without reading the footer and may help debug issues with offsets. It will also be useful if we add more serialization formats in the future. I agree it is unlikely we will be able to successfully deserialize the rest of the content if the offset is invalid, but still. If we end up in that situation, it would mean there was an ugly bug and having more metadata will only help. Overall, it does seem like a reasonable sanity check to me, similar to magic numbers in zstd and gzip. We once had to debug issues with bit flips while reading manifests. There was no easy way to prove we didn't corrupt the files and it was a faulty disk. The CRC check would catch those and save a ton of time. I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout. The length and byte orders are controversial. There is no merit in those beyond compatibility with Delta Lake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org