wombatu-kun opened a new pull request, #16508: URL: https://github.com/apache/iceberg/pull/16508
Closes #16458 ## Summary Five vectorized Parquet decoders in `iceberg-arrow` allocated heap arrays whose size came directly from the file bytes — varint headers in DELTA_BINARY_PACKED, the available-byte count in BYTE_STREAM_SPLIT, packed-group counts in the RLE/PACKED reader, and prefix-length entries in DELTA_BYTE_ARRAY. A crafted page could force a multi-hundred-MB allocation before any row-batch or page-size limit applied. Maintainers downgraded the report from security to a normal hardening task and invited a fix. This PR adds tight bound checks before each allocation, so a malformed page now fails with a descriptive `IllegalArgumentException` instead of allocating gigabytes of heap. Bounds come from each encoding's own math, not from arbitrary tuning knobs, so spec-conformant pages are unaffected. ## What changed `BaseVectorizedParquetValuesReader`: PACKED group count fits in `int` (avoids `numGroups * 8` overflow) and the run cannot claim more bytes than the page still has. `VectorizedDeltaEncodedValuesReader`: `blockSizeInValues` is in `(0, 1 << 20]` — the upper bound is ~8000× the parquet-mr default of 128, well clear of any legitimate writer. `miniBlocksPerBlock` is in `(0, blockSize]`. `totalValueCount` is in `[0, valueCount]` (it is the file-declared non-null count, which can never exceed the caller's total). `VectorizedByteStreamSplitValuesReader`: page size equals `valueCount * elementSizeInBytes` exactly (the encoding is byte-exact). The previously-ignored `valueCount` parameter is now used for validation. `readBinary(int len)` enforces `0 <= len <= decodedDataStream.remaining()` before allocating. `VectorizedDeltaByteArrayValuesReader`: prefix length fits in the previous value (spec invariant), total length is at least the prefix length, and the combined `prefix + suffix` length does not overflow `int`. `VectorizedDeltaLengthByteArrayValuesReader`: `readBinary(int len)` enforces `0 <= len <= dataStream.available()` before allocating. ## Tests One new test class per modified reader (`Test*ValuesReader`), plus an abstract `VectorizedParquetDecoderTestBase` that owns the shared forged-page helpers (`page`, `concat`). Each test class covers every new guard via a hand-built malformed page, plus a positive round-trip to confirm spec-conformant pages still decode correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
