yadavay-amzn opened a new pull request, #16347: URL: https://github.com/apache/iceberg/pull/16347
Fixes #16325. ## Problem When using GZIP or ZSTD compression, the row group size check in `ParquetWriter` uses `writeStore.getBufferedSize()` which reports compressed bytes after page flushes. Since compressed size is significantly smaller than the configured `targetRowGroupSize`, the threshold is never reached and row groups grow unbounded. ## Fix Track uncompressed bytes by measuring the `getBufferedSize()` delta before and after each `model.write()` call (before `endRecord()` triggers page flush and compression). Use this accumulated uncompressed size in `checkSize()` instead of the post-compression buffered size. Reset on row group flush. ## Testing Added `testRowGroupSizeEnforcedWithCompression` in `TestParquet` -- writes 500 records of ~1KB each with GZIP compression and a 64KB row group target. Asserts multiple row groups are created. - **Without fix**: all 500 records end up in 1 row group (compressed size never hits threshold) - **With fix**: multiple row groups created respecting the 64KB target -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
