rdblue commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796124558


##########
format/puffin-spec.md:
##########
@@ -123,6 +123,49 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+#### `delete-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted.  A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions, but is optimized for cases where
+most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps.
+64-bit positions are divided into a 32-bit "key" using the most significant 4
+bytes and a 32-bit sub-position using the least significant 4 bytes. For each
+key in the set of positions, a 32-bit Roaring bitmap is maintained to store a
+set of 32-bit sub-positions for that key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, little-endian

Review Comment:
   I talked with the Delta folks and all implementations rely on on these 
values to check the validity of the deletion vector. Removing this would mean 
those checks can't be run and would also break compatibility with all older 
Delta readers. The length is also intended to make the format self-describing. 
It is made up of these blobs concatenated together.
   
   The CRC is definitely useful and I don't think there's a reason to make 
minor changes that break compatibility like swapping the endianness. I'd use 
the same logic for storing the length. While it isn't necessary when embedding 
these in Puffin, it isn't such a problem that we should break compatibility 
with older readers, which would limit options later.
   
   Plus if we keep compatibility, then the deletion vector files in both Delta 
and Iceberg are the same format with the addition of a Puffin footer and magic 
bytes. At least it would be an option to use to Puffin as a container format 
later on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to