syun64 commented on PR #786: URL: https://github.com/apache/iceberg-python/pull/786#issuecomment-2145268428
Thank you for sharing the additional context @djouallah That's interesting! Although we don't use a pyarrow dataset in PyIceberg yet. We use a PyArrow dataset Fragment to create a dataset Scanner for each data file, and then directly create a PyArrow table instead of pushing down that layer into Arrow and using a PyArrow Dataset that represents the entire table of many files. I believe this is due to many outstanding issues that are being tracked together here on this issue: https://github.com/apache/iceberg-python/issues/30 My understanding of PyCapsuleInterface is that it is an interface for passing data back and forth Python and C layer if we wanted to write our own implementations of handling Arrow data format in the C layer. This is potentially an option, but I think the long-term roadmap for doing something more efficient for PyIceberg is to invest into Rust Python bindings as @Fokko mentioned at the Iceberg Summit. In the interim, this PR aims to just add an extra API that returns an Iterator[RecordBatch] or a RecordBatchReader instead of a Table so that we don't have to fully materialize the data into memory when we are scanning for data. Please let me know what you think about those above points, I want to make sure I'm understanding your suggestion correctly! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org