huaxingao opened a new pull request, #11390: URL: https://github.com/apache/iceberg/pull/11390
In Spark batch reading, Iceberg reads additional columns when there are delete files. For instance, if we have a table `test (int id, string data)` and a query `SELECT id FROM test`, the requested schema only contains the column `id`. However, to determine which rows are deleted (there is a `rowIdMapping` for this purpose), Iceberg appends `_pos` to the requested schema for position deletes, and append the equality filter column for equality deletes (suppose the equality delete is on column data). As a result, Iceberg will have `ColumnarBatchReader`s for these extra columns. In the case of position deletes, we actually don't need to read `_pos` to compute the `rowIdMapping`, so this PR excludes the `_pos` columns when building the `ColumnarBatchReader`. For equality deletes, while we need to read the equality filter column to compute the `rowIdMapping`, once we have the `rowIdMapping`, we should exclude the values of these extra columns from the `ColumnarBatch`. I will have a separate PR to fix equality delete. In summary: ``` SELECT id FROM test ``` For position delete, the vectorized reader currently returns a `ColumnarBatch` that contains arrow vectors for both `id` and `_pos`. This PR will make iceberg not read `_pos` column, so the returned `ColumnarBatch` only contains an arrow vector for `id` only. For equality delete (suppose the filter is on `data` column, the vectorized reader currently returns a `ColumnarBatch` that contains arrow vectors for both `id` and `data`. The goal is to return a `ColumnarBatch` that contains an arrow vector for `id` only. I will have a separate PR for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org