mattmartin14 commented on PR #1534: URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2608311564
Also @kevinjqliu - To address your question on datafusion. When I looked into this feature, I explored these 3 options for an arrow processing engine: 1. Duckdb 2. Datafusion 3. Daft I ultimately decided that datafusion would make the most sense, given these things it had going: - It's already owned by the Apache foundation. So licensing would be a non-issue - its very light weight and specifically designed to process and query arrow tables - it's rust based and if pyiceberg is ultimately going to be migrated to iceberg-rust one day, the integrations would be easier - The iceberg rust project is already building integrations for it, as seen [here](https://github.com/apache/iceberg-rust/tree/main/crates/integrations/datafusion). Hope this helps on how I arrived at that conclusion. Just using native pyarrow to try and process the data would be a very large uphill battle as we would effectively have to build our own data processing engine with it e.g. hash joins, sorting, optimizations, etc. I figured it does not make sense to reinvent the wheel and instead use an engine that is already out there (datafusion) and put it to good use. I took a look at the attachment you posted for any upcoming meetings for the pyiceberg sync, but did not see any 2025 meetings listed. I'd be glad to attend to discuss this further, if needed. Thanks, Matt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org