[PR] Exclude reading pos_ column if it's not in the scan list [iceberg]

via GitHub Thu, 24 Oct 2024 16:26:23 -0700


huaxingao opened a new pull request, #11390:
URL: https://github.com/apache/iceberg/pull/11390


   In Spark batch reading, Iceberg reads additional columns when there are 
delete files. For instance, if we have a table
   `test (int id, string data)` and a query `SELECT id FROM test`, the 
requested schema only contains the column `id`. However, to determine which 
rows are deleted (there is a `rowIdMapping` for this purpose), Iceberg appends 
`_pos` to the requested schema for position deletes, and append the equality 
filter column for equality deletes (suppose the equality delete is on column 
data). As a result, Iceberg will have `ColumnarBatchReader`s for these extra 
columns. In the case of position deletes, we actually don't need to read `_pos` 
to compute the `rowIdMapping`, so this PR excludes the `_pos` columns when 
building the `ColumnarBatchReader`. For equality deletes, while we need to read 
the equality filter column to compute the `rowIdMapping`, once we have the 
`rowIdMapping`, we should exclude the values of these extra columns from the 
`ColumnarBatch`. I will have a separate PR to fix equality delete.
   
   In summary:
   ```
   SELECT id FROM test
   ```
   For position delete, the vectorized reader currently returns a 
`ColumnarBatch` that contains arrow vectors for both `id` and `_pos`. This PR 
will make iceberg not read `_pos` column, so the returned `ColumnarBatch` only 
contains an arrow vector for `id` only.
   
   For equality delete (suppose the filter is on `data` column, the vectorized 
reader currently returns a `ColumnarBatch` that contains arrow vectors for both 
`id` and `data`. The goal is to return a `ColumnarBatch` that contains an arrow 
vector for `id` only. I will have a separate PR for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Exclude reading pos_ column if it's not in the scan list [iceberg]

Reply via email to