findepi commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1799278242


##########
format/puffin-spec.md:
##########
@@ -123,6 +123,44 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+#### `delete-vector-v1` blob type
+
+A serialized delete vector that represents the positions of rows in a file that
+are deleted.  A set bit at position P indicates that the row at position P is
+deleted.
+
+The bitmap supports positive 64-bit positions, but is optimized for cases where
+most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps.
+64-bit positions are divided into a 32-bit "key" using the most significant 4
+bytes and a 32-bit position using the least significant 4 bytes. For each key
+in the set of positions, a 32-bit Roaring bitmap is maintained to store a set
+of 32-bit positions for that key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes are
+tested for inclusion in the bitmap. If a bitmap is not found for the key, then
+it is not set.
+
+The serialized blob starts with a 4-byte magic sequence, `D1D33964` (1681511377
+stored as 4 bytes, little-endian). Following the magic bytes is the serialized
+collection of bitmaps. The collection is stored using the Roaring bitmap
+["portable" format][roaring-bitmap-portable-serialization]. This representation
+consists of:
+
+* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian
+* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit 
keys:
+    - The key stored as 4 bytes, little-endian
+    - A [32-bit Roaring bitmap][roaring-bitmap-general-layout]
+
+The blob metadata must include the following properties:
+
+* `referenced-data-file`: location of the data file the delete vector applies 
to

Review Comment:
   so the idea is that manifest will link to portions of the file (by 
path+offset+length)?
   but the referenced-data-file stays important property.
   
   i am concerned that, if it's self-describing but portions of information 
isn't used at execution, this leaves space for bugs to creep in. the 
self-descibing but unused portion may end up containing incorrect information, 
without anyone noticing (either for files produced by Iceberg itself, or by 
other implementors/query engines, who use Iceberg project for compatibility 
test purposes).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to