Re: [PR] Support `Table.to_arrow_batches` to return Iterator[Recordbatch] instead of a fully materialized Arrow Table [iceberg-python]

via GitHub Mon, 03 Jun 2024 06:55:01 -0700


syun64 commented on PR #786:
URL: https://github.com/apache/iceberg-python/pull/786#issuecomment-2145268428


   Thank you for sharing the additional context @djouallah 
   
   That's interesting! Although we don't use a pyarrow dataset in PyIceberg 
yet. We use a PyArrow dataset Fragment to create a dataset Scanner for each 
data file, and then directly create a PyArrow table instead of pushing down 
that layer into Arrow and using a PyArrow Dataset that represents the entire 
table of many files. I believe this is due to many outstanding issues that are 
being tracked together here on this issue: 
https://github.com/apache/iceberg-python/issues/30
   
   My understanding of PyCapsuleInterface is that it is an interface for 
passing data back and forth Python and C layer if we wanted to write our own 
implementations of handling Arrow data format in the C layer. This is 
potentially an option, but I think the long-term roadmap for doing something 
more efficient for PyIceberg is to invest into Rust Python bindings as @Fokko 
mentioned at the Iceberg Summit.
   
   In the interim, this PR aims to just add an extra API that returns an 
Iterator[RecordBatch] or a RecordBatchReader instead of a Table so that we 
don't have to fully materialize the data into memory when we are scanning for 
data.
   
   Please let me know what you think about those above points, I want to make 
sure I'm understanding your suggestion correctly!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support `Table.to_arrow_batches` to return Iterator[Recordbatch] instead of a fully materialized Arrow Table [iceberg-python]

Reply via email to