mbutrovich opened a new pull request, #2615:
URL: https://github.com/apache/iceberg-rust/pull/2615

   ## Which issue does this PR close?
   
   - Closes #2614.
   
   ## What changes are included in this PR?
   
   `ArrowReader::filter_row_groups_by_byte_range` (added in #1779) selected a 
row group for every `FileScanTask` byte range that *overlapped* it. When a data 
file is split into byte ranges smaller than a row group, multiple tasks overlap 
the same row group, so each reads it and the rows are duplicated. This surfaced 
in Apache DataFusion Comet 
([apache/datafusion-comet#4590](https://github.com/apache/datafusion-comet/issues/4590)),
 where Spark tiles a file by `split-size` regardless of row-group layout.
   
   This PR switches selection to *midpoint ownership*, matching parquet-java's 
[`ParquetMetadataConverter.filterFileMetaDataByMidpoint`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1287-L1290):
 a split owns a row group only if its `[start, start+length)` range contains 
the row group's midpoint. Because the splits tile the file contiguously and 
disjointly, exactly one split contains a given midpoint, so each row group is 
read once. The comparison is half-open (`start <= midpoint < end`) so a 
midpoint landing on a split boundary belongs to the upper split.
   
   This is a no-op for whole-file tasks (`start=0, length=file_size`, all 
iceberg-rust's own planner emits), since every midpoint lies in range and all 
row groups are selected. It only changes the externally-planned sub-row-group 
split case, completing the byte-range work from #1779.
   
   ## Are these changes tested?
   
   Yes. A new test `test_sub_row_group_splits_do_not_duplicate_rows` writes a 
3-row-group file, tiles it into 64-byte splits, reads every split through 
`ArrowReader`, and asserts each row appears exactly once. It returns ~2800 rows 
before the fix and exactly 300 after. The existing 
`test_file_splits_respect_byte_ranges` (boundary-aligned splits) continues to 
pass.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to