wombatu-kun opened a new pull request, #16500:
URL: https://github.com/apache/iceberg/pull/16500

   Closes #16475
   
   Hostile or corrupted manifest metadata can claim a negative or near-2 GiB 
`content_offset` / `content_size_in_bytes` on a deletion-vector `DeleteFile`. 
The existing precondition in `BaseDeleteLoader.validateDV` only checks non-null 
and `size <= Integer.MAX_VALUE`, so a negative size slips through to `new 
byte[length]` and throws `NegativeArraySizeException`; a negative offset 
reaches `IOUtil.readFully` and produces an invalid seek. The same unguarded 
allocation pattern lives in `DVUtil.readDV` on the rewrite / DV-merge path used 
by `mergeAndWriteDVsIfRequired`.
   
   ### Changes
   
   - Add `ContentFileUtil.validateDV` as the canonical read-path validator 
(null offset, null size, `offset >= 0`, `size >= 0`, `size <= 
Integer.MAX_VALUE`).
   - Have both `BaseDeleteLoader.validateDV` (data) and `DVUtil.readDV` (core) 
delegate to it. `BaseDeleteLoader` keeps its loader-specific 
`referencedDataFile` check.
   - For defense-in-depth, also reject negative `contentOffset` / 
`contentSizeInBytes` in `FileMetadata.Builder.build()` for `PUFFIN`, mirroring 
the existing `fileSizeInBytes >= 0` / `recordCount >= 0` checks at the same 
site. Producers using the builder can no longer accidentally emit bad metadata; 
the read-side check still defends against corrupted on-disk manifests where 
`BaseFile.internalSet` hydrates fields directly via Avro reflection.
   
   ### Tests
   
   - `TestContentFileUtil` (new) — covers each precondition of the new 
validator (null offset/size, negative offset/size, oversized size, valid zero / 
typical / `Integer.MAX_VALUE` values).
   - `TestFileMetadata` (new) — verifies the builder rejects negative DV 
offset/size and still accepts valid values.
   - `TestBaseDeleteLoader` (new, in package `org.apache.iceberg` so it can 
reach the package-private `Delegates.DelegatingDeleteFile`) — constructs a 
tampered `DeleteFile` that bypasses the builder and asserts 
`BaseDeleteLoader.loadPositionDeletes` throws before any I/O is attempted.
   - Existing `TestRowDelta` (324 tests), `TestDeletionVectorStruct`, 
`TestContentFileParser`, `TestDeleteFiles` regression suites continue to pass.
   
   ### Out of scope
   
   `PuffinReader` / `BlobMetadata` similarly don't range-check blob offsets and 
lengths today, but Puffin is also used for non-DV blobs (statistics) and 
"valid" at that layer is a separate discussion of footer-authoritative vs. 
caller-supplied bounds. Leaving Puffin-level hardening as a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to