mbutrovich opened a new pull request, #1779: URL: https://github.com/apache/iceberg-rust/pull/1779
## Rationale for this change Iceberg's file splitting feature allows large Parquet files to be divided across multiple tasks for parallel processing. When Iceberg Java splits a file using `splitOffsets()`, it returns byte positions corresponding to row group boundaries, and each `FileScanTask` contains `start` and `length` fields specifying which byte range to read. However, iceberg-rust ignores these fields for row group pruning. This manifested as a test failure in [Comet](https://github.com/apache/datafusion-comet/pull/2528) where the Iceberg Java test `TestRewriteDataFilesAction` returned duplicate rows. **Root cause**: The `process_file_scan_task` function in iceberg-rust does not have row group filtering based on the `start` and `length` byte ranges, despite these fields being passed into FileScanTasks. ## What changes are included in this PR? 1. **New method `filter_row_groups_by_byte_range`** (lines 733-776): - Calculates the byte position of each row group in the Parquet file - Determines which row groups overlap with the specified `[start, start+length)` byte range - Returns a vector of row group indices to read - Accounts for the 4-byte Parquet magic header ("PAR1") 2. **Integrated byte range filtering into `process_file_scan_task`** (lines 245-291): - Calls `filter_row_groups_by_byte_range` when `start != 0 || length != 0` to maintain backwards compatibility - Properly merges byte range filtering with predicate-based filtering by computing the intersection - Updated comments to reflect three sources of row group filtering (byte ranges, equality deletes, and predicates) 3. **Technical details**: - Byte range filtering happens before predicate filtering to reduce the search space - When both byte range and predicate filters exist, only row groups passing both filters are read - The implementation correctly handles overlapping byte ranges using the standard range overlap formula: `rg_start < end && start < rg_end` ## Are these changes tested? 1. New test `test_file_splits_respect_byte_ranges` (lines 1325-1523): - Creates a Parquet file with 3 row groups (100 rows each, 300 total) - Creates two `FileScanTask` instances with different byte ranges: - Task 1: bytes covering only the first row group → expects 100 rows - Task 2: bytes covering second and third row groups → expects 200 rows - Verifies both row counts and actual data values are correct - **Previously failed** (both tasks returned all 300 rows) - **Now passes** (Task 1 returns 100 rows, Task 2 returns 200 rows) 2. Iceberg Java tests TestRewriteDataFilesAction now pass with Comet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
