wypoon opened a new pull request, #6026:
URL: https://github.com/apache/iceberg/pull/6026

   There is a bug in Parquet vectorized reads reported in 
https://github.com/apache/iceberg/issues/5927.
   This bug happens when reading a Parquet data file (using the 
`BatchDataReader`) that is bigger than the split size, and there are deletes 
that need to be applied to the data file. The cause of the bug is that 
`ColumnarBatchReader#setRowGroupInfo` is not called with the correct 
`rowPosition`, and that is because in `ReadConf`,  
`generateOffsetToStartPos(Schema)` returns null due to an optimization. (When 
this happens, the `startRowPositions` array is always populated with 0s, and 
thus `ColumnarBatchReader#setRowGroupInfo` gets called with `rowPosition` 0 
even when the `rowPosition` is that of the second or subsequent row group. In 
`ColumnarBatchReader`, `setRowGroupInfo` initializes a `rowStartPosInBatch`  
field, which is used to determine where in the `PositionDeleteIndex` to start 
applying deletes from. When `rowStartPosInBatch` is incorrectly initialized, 
the indexes of positional deletes are not correctly aligned with the rows in 
the data file.)
   The fix is to ensure that when there are deletes, the Schema has the `_pos` 
metadata column in it. Then `ReadConf#generateOffsetToStartPos(Schema)` will 
generate the necessary `Map` that is used to compute the `startRowPositions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to