mattmartin14 opened a new pull request, #1534:
URL: https://github.com/apache/iceberg-python/pull/1534

   Hi,
   
   This is my first PR to the pyiceberg project, so I hope it is well received 
and I’m open to any feedback. This feature has been long asked by the pyiceberg 
community to be able to perform an “upsert” on an iceberg table a.k.a. MERGE 
command. There is a lot to unpack on this PR, so I’ll start with some high 
level items and go into more detail.
   
   For basic terminology, an upsert/merge is where you take a new dataset and 
merge it to the target Iceberg table by doing an update on the target table for 
rows that have changed and an insert for new rows. This is all performed 
atomically as 1 transaction, meaning both the update and the insert succeed 
together, or they fail; this ensures the table is not left in an unknown state 
if one of the actions were able to succeed but the other errored out.
   
   Over the last decade, the ANSI SQL MERGE command has evolved a lot to handle 
more than just update existing rows, insert new rows. This PR aims to just 
cover the BASIC upsert pattern; it has been architected to allow more of the 
ANSI MERGE command features over time (via the merge_options paramater) to be 
included if others would like to contribute, such as “WHEN NOT MATCHED BY 
SOURCE THEN DELETE”.
   
   In order to efficiently process an upsert on the Iceberg table, it requires 
an engine to detect what rows have changed and what rows are new. The engine I 
have chosen for this new feature is datafusion, which is a rust based high 
performance pyarrow data processing engine; I realize this introduces a new 
dependency for the pyiceberg project, which i've flagged as optional in the 
pyproject.toml file. My rational on choosing datafusion is this is also the 
same library that is being used for the iceberg-rust project. Thus, down the 
road, remediation of this merge command to the more modern iceberg-rust 
implementation should be minimal.
   
   As far as test coverage is concerned, my unit tests cover performing upserts 
on both single key and composite key Iceberg tables (meaning the table has more 
than 1 key field it needs to join on). For the performance testing, single key 
tables scale just fine. My unit tests does a 10k update/insert for one of the 
series of tets, but I have on my local workstation (Mac M2 pro) ran a test of 
500k rows of updates/inserts on a single key join and it scales just fine.
   
   Where I am hitting a wall on performance and am needing other’s help here is 
when you want to use a composite key. The composite key code builds an 
overwrite filter, and once that filter gets too lengthy (in my testing more 
than 200 rows), the visitor “OR” function in pyiceberg hits a recursion depth 
error. I took a hard look at the visitor code in the expressions folder and I 
was hesitant to try and change any of those functions due to me not having a 
clear understanding of how they work, and this is where I need other’s help. If 
you want to see this scaling cocern, simply update the paramater on my 
test_merge_scenario_3_composite_key def to generate a target table with 1000 
rows and a source table with 2000 rows and you will see the error surface. I 
don't think its smart nor pratical to try and change the python recursion depth 
default limit because we will still hit the wall at some point unless the 
visitor "OR" function gets reworked to avoid recursion. 
   
   Some other ideas i've kicked around to try and mitigate this, is to have the 
merge_rows code temporarily modify the source dataframe and the target iceberg 
table to build a concatenated single key of all the composite keys to join on 
e.g. "col1_col2_col[n]...". But that would require modifying the entire iceberg 
target table with a full overwrite, which then defeats the purpose of an upsert 
to be able to run incremental updates on a table not have to overwrite the 
entire table.
   
   I think this PR is a good first step to finally realizing MERGE/upsert 
support for pyiceberg and gets a piercing in the armor. Again, I’m welcome to 
other’s feedback on this topic and look forward to partnering with you all on 
this journey.
   
   Thanks,
   Matt Martin


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to