jordepic opened a new issue, #2619:
URL: https://github.com/apache/iceberg-rust/issues/2619
### Apache Iceberg Rust version
None
### Describe the bug
filter_row_groups_by_byte_range selects a row group for every scan split
whose byte range overlaps it. When a row group is larger than the split size —
e.g. tables written with large write.parquet.row-group-size-bytes (1 GB), where
a file is effectively a single row group spanning all of its splits — every
split selects that row group and reads its full contents. The result is the
same rows returned once per split: an N× over-read (N = number of splits
covering the file). This is a correctness/data-duplication bug: a ~1.6M-row
partition read back ~83M rows, and writing the result back corrupts the table.
### To Reproduce
To Reproduce:
1. Write a Parquet data file with a single large row group (e.g. set the
row-group size to effectively unbounded so all rows land in one row group).
2. Plan a scan that splits the file into multiple tasks (split size < file
size), so several splits' byte ranges overlap the one row group. You need to
pass these in via the public FileScanTask API.
3. Read all splits and union the results.
4. Row count is a multiple of the actual file row count (each split
re-reads the whole row group).
(I personally encountered the issue in DataFusion Comet)
### Expected behavior
Each row group is assigned to exactly one split — the one whose byte range
contains the row group's midpoint (start <= midpoint < end) — so total rows
read equal the file's actual row count, regardless of split size vs. row-group
size. This matches parquet-mr / Spark's split-assignment semantics.
### Willingness to contribute
I can contribute a fix for this bug independently
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]