kevinjqliu commented on PR #1534:
URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2600991160

   Thanks @mattmartin14 for the PR! And thanks @bitsondatadev on the tips on 
working in OSS. I certainly had to learn a lot of these over the years. 
   
   A couple things I think we can address first. 
   
   1. Support for MERGE INTO / Upsert
   
   This has been a much anticipated and asked feature in the community. Issue 
#402 has been tracking it with many eyes on it. I think we still need to figure 
out the best approach to support this feature. 
   
   Like you mentioned in the description, `MERGE INTO` is a query engine 
feature. Pyiceberg itself is a client library to support the Iceberg python 
ecosystem. Pyiceberg aims to provide the necessary Iceberg building blocks so 
that other engines/programs can interact with Iceberg tables easily. 
   
   As we’re building out more of more engine-like features, it becomes harder 
to support more complex and data-intensive workloads such as MERGE INTO. We 
have been able to use pyarrow for query processing but it has its own 
limitations. For more compute intensive workloads, such as Bucket and Truncate 
transform, we were able to leverage rust (iceberg-rust) to handle the 
computation.
   
   Looking at #402, I don’t see any concrete plans on how we can support MERGE 
INTO. I’ve added this as an agenda on the [monthly pyiceberg 
sync](https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?tab=t.0#heading=h.rxx2wa3o215y)
 and will post the update. Please join us if you have time! 
   
   2. Taking on Datafusion as a dependency
   
   I’m very interested in exploring datafusion and ways we can leverage it for 
this project. As I mentioned above, we currently use pyarrow to handle most of 
the compute. It’ll be interesting to evaluate datafusion as an alternative. 
Datafusion has its own ecosystem of expression api, dataframe api, and runtime. 
All of which are good complements to pyiceberg. It has integrations with the 
rust side as well, something I have started exploring in 
https://github.com/apache/iceberg-rust/issues/865
   
   That said, I think we need a wider discussion and alignment on how to 
integrate with datafusion. It’s a good time to start thinking about it! I’ve 
added this as another discussion item on the monthly sync. 
   
   3. Performance concerns
   
   Compute intensive workloads are generally a bottleneck in python. I am 
excited for future pyiceberg <> iceberg-rust integration where we can leverage 
rust to perform those computations. 
   
   > The composite key code builds an overwrite filter, and once that filter 
gets too lengthy (in my testing more than 200 rows), the visitor “OR” function 
in pyiceberg hits a recursion depth error.
   
   This is an interesting observation and I think I’ve seen someone else run 
into this issue before. We’d want to address this separately. This is something 
we might want to explore using datafusion’s expression api to replace our own 
parser.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to