kevinjqliu commented on PR #1534: URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2600991160
Thanks @mattmartin14 for the PR! And thanks @bitsondatadev on the tips on working in OSS. I certainly had to learn a lot of these over the years. A couple things I think we can address first. 1. Support for MERGE INTO / Upsert This has been a much anticipated and asked feature in the community. Issue #402 has been tracking it with many eyes on it. I think we still need to figure out the best approach to support this feature. Like you mentioned in the description, `MERGE INTO` is a query engine feature. Pyiceberg itself is a client library to support the Iceberg python ecosystem. Pyiceberg aims to provide the necessary Iceberg building blocks so that other engines/programs can interact with Iceberg tables easily. As we’re building out more of more engine-like features, it becomes harder to support more complex and data-intensive workloads such as MERGE INTO. We have been able to use pyarrow for query processing but it has its own limitations. For more compute intensive workloads, such as Bucket and Truncate transform, we were able to leverage rust (iceberg-rust) to handle the computation. Looking at #402, I don’t see any concrete plans on how we can support MERGE INTO. I’ve added this as an agenda on the [monthly pyiceberg sync](https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?tab=t.0#heading=h.rxx2wa3o215y) and will post the update. Please join us if you have time! 2. Taking on Datafusion as a dependency I’m very interested in exploring datafusion and ways we can leverage it for this project. As I mentioned above, we currently use pyarrow to handle most of the compute. It’ll be interesting to evaluate datafusion as an alternative. Datafusion has its own ecosystem of expression api, dataframe api, and runtime. All of which are good complements to pyiceberg. It has integrations with the rust side as well, something I have started exploring in https://github.com/apache/iceberg-rust/issues/865 That said, I think we need a wider discussion and alignment on how to integrate with datafusion. It’s a good time to start thinking about it! I’ve added this as another discussion item on the monthly sync. 3. Performance concerns Compute intensive workloads are generally a bottleneck in python. I am excited for future pyiceberg <> iceberg-rust integration where we can leverage rust to perform those computations. > The composite key code builds an overwrite filter, and once that filter gets too lengthy (in my testing more than 200 rows), the visitor “OR” function in pyiceberg hits a recursion depth error. This is an interesting observation and I think I’ve seen someone else run into this issue before. We’d want to address this separately. This is something we might want to explore using datafusion’s expression api to replace our own parser. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org