zeroshade commented on code in PR #1127:
URL: https://github.com/apache/iceberg-go/pull/1127#discussion_r3523596775


##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile) 
(*RoaringPositionBitmap, error
                return nil, fmt.Errorf("read DV blob at offset %d: %w", offset, 
err)
        }
 
-       cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+       blobCardinality, hasBlobCardinality, err := 
blobCardinality(reader.Blobs(), offset, size)
        if err != nil {
                return nil, fmt.Errorf("DV file %s: %w", dvFile.FilePath(), err)
        }
-       if !ok {
+       manifestCardinality := dvFile.Count()
+       hasManifestCardinality := manifestCardinality > 0
+
+       var expectedCardinality int64
+       switch {
+       case hasManifestCardinality && hasBlobCardinality:
+               if manifestCardinality != blobCardinality {
+                       return nil, fmt.Errorf("manifest record_count %d 
disagrees with puffin cardinality property %d",

Review Comment:
   Two polish items while updating this block: line 176 shadows the 
package-level `blobCardinality` helper, so renaming the local to something like 
`puffinCardinality` would avoid confusion; and this mismatch error should 
include the DV file path, consistent with the neighboring `ReadDV` errors, to 
make corrupt-file diagnostics actionable.



##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile) 
(*RoaringPositionBitmap, error
                return nil, fmt.Errorf("read DV blob at offset %d: %w", offset, 
err)
        }
 
-       cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+       blobCardinality, hasBlobCardinality, err := 
blobCardinality(reader.Blobs(), offset, size)
        if err != nil {
                return nil, fmt.Errorf("DV file %s: %w", dvFile.FilePath(), err)
        }
-       if !ok {
+       manifestCardinality := dvFile.Count()
+       hasManifestCardinality := manifestCardinality > 0

Review Comment:
   This treats `record_count == 0` as if the manifest cardinality were absent. 
For an `iceberg.DataFile`, manifest `record_count` is a required non-nullable 
long, so zero is a valid expected cardinality and still needs to participate in 
both checks. As written, manifest 0 + Puffin cardinality 5 + decoded bitmap 
cardinality 5 falls into the blob-only path and is accepted instead of 
rejected. Please treat the manifest cardinality as present for data files, 
compare it to the Puffin cardinality whenever the property exists, and validate 
the decoded bitmap against it, including zero. A regression test for manifest 0 
/ Puffin 5 / bitmap 5 should error.



##########
table/dv/deletion_vector_test.go:
##########
@@ -413,18 +413,29 @@ func TestReadDVCardinalityValidation(t *testing.T) {
                assert.Equal(t, int64(5), bm.Cardinality())
        })
 
-       t.Run("mismatched cardinality is rejected", func(t *testing.T) {
+       t.Run("manifest and puffin cardinality disagreement is rejected", 
func(t *testing.T) {
                dir := t.TempDir()
                path, meta := writePuffinWithDVBlobAndProps(t, dir, 
dvBlobBytes, map[string]string{
                        "referenced-data-file": 
"s3://bucket/data/data-001.parquet",
-                       // Bitmap actually has 5 positions; claim 99.
+                       // Manifest record_count below says 5; claim 99 in the 
Puffin property.
                        "cardinality": "99",
                })
                offset, size := meta.Offset, meta.Length
                _, err := ReadDV(iceio.LocalFS{}, newDVTestFile(path, 5, 
&offset, &size))
                require.Error(t, err)
-               // Pin the specific error path: DeserializeDV's "cardinality
-               // mismatch", not the helper's "invalid cardinality" parse 
error.
+               assert.Contains(t, err.Error(), "manifest record_count 5 
disagrees with puffin cardinality property 99")
+       })
+
+       t.Run("matching manifest and puffin cardinality still validates 
bitmap", func(t *testing.T) {
+               dir := t.TempDir()
+               path, meta := writePuffinWithDVBlobAndProps(t, dir, 
dvBlobBytes, map[string]string{
+                       "referenced-data-file": 
"s3://bucket/data/data-001.parquet",
+                       // Both metadata sources agree, but the bitmap actually 
has 5 positions.
+                       "cardinality": "99",
+               })
+               offset, size := meta.Offset, meta.Length
+               _, err := ReadDV(iceio.LocalFS{}, newDVTestFile(path, 99, 
&offset, &size))
+               require.Error(t, err)
                assert.Contains(t, err.Error(), "cardinality mismatch")

Review Comment:
   Nearby real line 442 still claims to cover the missing-cardinality-property 
path, but it calls `writePuffinWithDVBlob`, which writes `cardinality: 5`, so 
the test exercises the normal cardinality path instead of missing-property 
tolerance. Please add a fixture/helper that actually omits the cardinality 
property, and include a zero-cardinality case so the manifest-zero behavior is 
covered explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to