[PR] fix(reader): `filter_row_groups_by_byte_range` duplicates rows for sub-row-group file splits [iceberg-rust]

via GitHub Wed, 10 Jun 2026 11:04:17 -0700


mbutrovich opened a new pull request, #2615:
URL: https://github.com/apache/iceberg-rust/pull/2615

## Which issue does this PR close?

- Closes #2614.

## What changes are included in this PR?

`ArrowReader::filter_row_groups_by_byte_range` (added in #1779) selected a
row group for every `FileScanTask` byte range that *overlapped* it. When a data
file is split into byte ranges smaller than a row group, multiple tasks overlap
the same row group, so each reads it and the rows are duplicated. This surfaced
in Apache DataFusion Comet
([apache/datafusion-comet#4590](https://github.com/apache/datafusion-comet/issues/4590)),
where Spark tiles a file by `split-size` regardless of row-group layout.

This PR switches selection to *midpoint ownership*, matching parquet-java's
[`ParquetMetadataConverter.filterFileMetaDataByMidpoint`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1287-L1290):
a split owns a row group only if its `[start, start+length)` range contains
the row group's midpoint. Because the splits tile the file contiguously and
disjointly, exactly one split contains a given midpoint, so each row group is
read once. The comparison is half-open (`start <= midpoint < end`) so a
midpoint landing on a split boundary belongs to the upper split.

This is a no-op for whole-file tasks (`start=0, length=file_size`, all
iceberg-rust's own planner emits), since every midpoint lies in range and all
row groups are selected. It only changes the externally-planned sub-row-group
split case, completing the byte-range work from #1779.

## Are these changes tested?

Yes. A new test `test_sub_row_group_splits_do_not_duplicate_rows` writes a
3-row-group file, tiles it into 64-byte splits, reads every split through
`ArrowReader`, and asserts each row appears exactly once. It returns ~2800 rows
before the fix and exactly 300 after. The existing
`test_file_splits_respect_byte_ranges` (boundary-aligned splits) continues to
pass.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix(reader): `filter_row_groups_by_byte_range` duplicates rows for sub-row-group file splits [iceberg-rust]

Reply via email to