wypoon opened a new pull request, #6026: URL: https://github.com/apache/iceberg/pull/6026
There is a bug in Parquet vectorized reads reported in https://github.com/apache/iceberg/issues/5927. This bug happens when reading a Parquet data file (using the `BatchDataReader`) that is bigger than the split size, and there are deletes that need to be applied to the data file. The cause of the bug is that `ColumnarBatchReader#setRowGroupInfo` is not called with the correct `rowPosition`, and that is because in `ReadConf`, `generateOffsetToStartPos(Schema)` returns null due to an optimization. (When this happens, the `startRowPositions` array is always populated with 0s, and thus `ColumnarBatchReader#setRowGroupInfo` gets called with `rowPosition` 0 even when the `rowPosition` is that of the second or subsequent row group. In `ColumnarBatchReader`, `setRowGroupInfo` initializes a `rowStartPosInBatch` field, which is used to determine where in the `PositionDeleteIndex` to start applying deletes from. When `rowStartPosInBatch` is incorrectly initialized, the indexes of positional deletes are not correctly aligned with the rows in the data file.) The fix is to ensure that when there are deletes, the Schema has the `_pos` metadata column in it. Then `ReadConf#generateOffsetToStartPos(Schema)` will generate the necessary `Map` that is used to compute the `startRowPositions`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org