emkornfield commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1797391875


##########
format/puffin-spec.md:
##########
@@ -123,6 +123,54 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+#### `delete-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted.  A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions (the most significant bit must be
+0), but is optimized for cases where most positions fit in 32 bits by using a
+collection of 32-bit Roaring bitmaps.  64-bit positions are divided into a
+32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
+the least significant 4 bytes. For each key in the set of positions, a 32-bit
+Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
+key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`

Review Comment:
   > The CRC is computed on the bytes so it can only catch random bit flips.
   
   I'd flip this on its head and I'd like to understand what use-cases we 
expect it to sanity check that would fail?  The CRC doesn't just detect 
bit-flips, it also detects for accidental truncation/enlargement of the range 
(e.g. if you leave of the last byte, it is very unlikely the CDC check would 
pass).  If the argument is the CRC will not be checked on most reads, then the 
consistency of the vectors provides another sanity check (I think it would be 
quite hard to get random sampled bytes that would fully parse, and 
implementations should be careful in parsing it here).
   
   >  I do think checking the magic number is a reasonable sanity check which 
is fairly common in the industry. 
   
   I'm not saying in general it is not useful and even if this case it might 
be, but magic numbers are typically a file level concept so that readers can 
determine if the file-type is something they can in fact read (and disambiguate 
between multiple types of binary formatted data).  In this case this 
information is redundant with the information in the footer.  For direct reads 
based on manifest data, I think a reasonable design option would be to 
introduce a new content type value that reflected 'deletion-vector-v1', which 
would allow readers to fail earlier if the type is not recognized.
   
   As another example, theta sketches in puffin don't include a magic number 
(and as far as I [can tell the library by 
itself](https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/theta/CompactSketch.java#L281)
 doesn't add one for serialization).  
   
   Ultimately it is just 4 bytes, so it might not pay to argue too much about 
it, especially if the community agrees that the compatibility is worth it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to