Fokko opened a new issue, #2396: URL: https://github.com/apache/iceberg-python/issues/2396
### Feature Request / Improvement Today we highly lean on PyArrow to do the reading of the Parquet files, but this has some big disadvantages: - PyArrow does not treat Field-IDs as first class citizens. Therefore we have to first get the physical schema (from the Parquet files) and [prune the schema](https://github.com/apache/iceberg-python/blob/3eecdadc000047ec30749fc5d6ce1f2f072a30b2/pyiceberg/io/pyarrow.py#L1516) based on field-IDs. - We have to post-process the buffers to apply schema evolution. For example, if a table has promoted an integer to a long, Iceberg does not rewrite the datafiles with the new column. Instead, when we see an integer at read-time, we [promote the buffer to a long](https://github.com/apache/iceberg-python/blob/3eecdadc000047ec30749fc5d6ce1f2f072a30b2/pyiceberg/io/pyarrow.py#L1554-L1560). Ideally we want to push this down to the reader right away. If we could push this down into Iceberg-Rust, and return references to Arrow buffers back to PyIceberg, that would be great. We can start simple first by still applying the merge-on-read deletes in PyIceberg, and move that over to Iceberg-Rust step by step. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
