mbutrovich opened a new pull request, #1779:
URL: https://github.com/apache/iceberg-rust/pull/1779

   ## Rationale for this change
   
   Iceberg's file splitting feature allows large Parquet files to be divided 
across multiple tasks for parallel processing. When Iceberg Java splits a file 
using `splitOffsets()`, it returns byte positions corresponding to row group 
boundaries, and each `FileScanTask` contains `start` and `length` fields 
specifying which byte range to read.
   
   However, iceberg-rust ignores these fields for row group pruning. This 
manifested as a test failure in 
[Comet](https://github.com/apache/datafusion-comet/pull/2528) where the Iceberg 
Java test `TestRewriteDataFilesAction` returned duplicate rows.
   
   **Root cause**: The `process_file_scan_task` function in iceberg-rust does 
not have row group filtering based on the `start` and `length` byte ranges, 
despite these fields being passed into FileScanTasks.
   
   ## What changes are included in this PR?
   
   1. **New method `filter_row_groups_by_byte_range`** (lines 733-776):
      - Calculates the byte position of each row group in the Parquet file
      - Determines which row groups overlap with the specified `[start, 
start+length)` byte range
      - Returns a vector of row group indices to read
      - Accounts for the 4-byte Parquet magic header ("PAR1")
   
   2. **Integrated byte range filtering into `process_file_scan_task`** (lines 
245-291):
      - Calls `filter_row_groups_by_byte_range` when `start != 0 || length != 
0` to maintain backwards compatibility
      - Properly merges byte range filtering with predicate-based filtering by 
computing the intersection
      - Updated comments to reflect three sources of row group filtering (byte 
ranges, equality deletes, and predicates)
   
   3. **Technical details**:
      - Byte range filtering happens before predicate filtering to reduce the 
search space
      - When both byte range and predicate filters exist, only row groups 
passing both filters are read
      - The implementation correctly handles overlapping byte ranges using the 
standard range overlap formula: `rg_start < end && start < rg_end`
   
   ## Are these changes tested?
   
   1. New test `test_file_splits_respect_byte_ranges` (lines 1325-1523):
      - Creates a Parquet file with 3 row groups (100 rows each, 300 total)
      - Creates two `FileScanTask` instances with different byte ranges:
        - Task 1: bytes covering only the first row group → expects 100 rows
        - Task 2: bytes covering second and third row groups → expects 200 rows
      - Verifies both row counts and actual data values are correct
      - **Previously failed** (both tasks returned all 300 rows)
      - **Now passes** (Task 1 returns 100 rows, Task 2 returns 200 rows)
   
   2. Iceberg Java tests TestRewriteDataFilesAction now pass with Comet
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to