[I] Delegate reading DataFiles to Iceberg-Rust [iceberg-python]

via GitHub Thu, 28 Aug 2025 01:36:56 -0700


Fokko opened a new issue, #2396:
URL: https://github.com/apache/iceberg-python/issues/2396


   ### Feature Request / Improvement
   
   Today we highly lean on PyArrow to do the reading of the Parquet files, but 
this has some big disadvantages:
   
   - PyArrow does not treat Field-IDs as first class citizens. Therefore we 
have to first get the physical schema (from the Parquet files) and [prune the 
schema](https://github.com/apache/iceberg-python/blob/3eecdadc000047ec30749fc5d6ce1f2f072a30b2/pyiceberg/io/pyarrow.py#L1516)
 based on field-IDs.
   - We have to post-process the buffers to apply schema evolution. For 
example, if a table has promoted an integer to a long, Iceberg does not rewrite 
the datafiles with the new column. Instead, when we see an integer at 
read-time, we [promote the buffer to a 
long](https://github.com/apache/iceberg-python/blob/3eecdadc000047ec30749fc5d6ce1f2f072a30b2/pyiceberg/io/pyarrow.py#L1554-L1560).
 Ideally we want to push this down to the reader right away.
   
   If we could push this down into Iceberg-Rust, and return references to Arrow 
buffers back to PyIceberg, that would be great. We can start simple first by 
still applying the merge-on-read deletes in PyIceberg, and move that over to 
Iceberg-Rust step by step.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Delegate reading DataFiles to Iceberg-Rust [iceberg-python]

Reply via email to