Re: [PR] Feature: MERGE/Upsert Support [iceberg-python]

via GitHub Sun, 19 Jan 2025 11:45:28 -0800


kevinjqliu commented on PR #1534:
URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2600991160

Thanks @mattmartin14 for the PR! And thanks @bitsondatadev on the tips on
working in OSS. I certainly had to learn a lot of these over the years.

A couple things I think we can address first.

1. Support for MERGE INTO / Upsert

This has been a much anticipated and asked feature in the community. Issue
#402 has been tracking it with many eyes on it. I think we still need to figure
out the best approach to support this feature.

Like you mentioned in the description, `MERGE INTO` is a query engine
feature. Pyiceberg itself is a client library to support the Iceberg python
ecosystem. Pyiceberg aims to provide the necessary Iceberg building blocks so
that other engines/programs can interact with Iceberg tables easily.

As we’re building out more of more engine-like features, it becomes harder
to support more complex and data-intensive workloads such as MERGE INTO. We
have been able to use pyarrow for query processing but it has its own
limitations. For more compute intensive workloads, such as Bucket and Truncate
transform, we were able to leverage rust (iceberg-rust) to handle the
computation.

Looking at #402, I don’t see any concrete plans on how we can support MERGE
INTO. I’ve added this as an agenda on the [monthly pyiceberg
sync](https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U/edit?tab=t.0#heading=h.rxx2wa3o215y)
and will post the update. Please join us if you have time!

2. Taking on Datafusion as a dependency

I’m very interested in exploring datafusion and ways we can leverage it for
this project. As I mentioned above, we currently use pyarrow to handle most of
the compute. It’ll be interesting to evaluate datafusion as an alternative.
Datafusion has its own ecosystem of expression api, dataframe api, and runtime.
All of which are good complements to pyiceberg. It has integrations with the
rust side as well, something I have started exploring in
https://github.com/apache/iceberg-rust/issues/865

That said, I think we need a wider discussion and alignment on how to
integrate with datafusion. It’s a good time to start thinking about it! I’ve
added this as another discussion item on the monthly sync.

3. Performance concerns

Compute intensive workloads are generally a bottleneck in python. I am
excited for future pyiceberg <> iceberg-rust integration where we can leverage
rust to perform those computations.

> The composite key code builds an overwrite filter, and once that filter
gets too lengthy (in my testing more than 200 rows), the visitor “OR” function
in pyiceberg hits a recursion depth error.

This is an interesting observation and I think I’ve seen someone else run
into this issue before. We’d want to address this separately. This is something
we might want to explore using datafusion’s expression api to replace our own
parser.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Feature: MERGE/Upsert Support [iceberg-python]

Reply via email to