mattmartin14 opened a new pull request, #1534: URL: https://github.com/apache/iceberg-python/pull/1534
Hi, This is my first PR to the pyiceberg project, so I hope it is well received and I’m open to any feedback. This feature has been long asked by the pyiceberg community to be able to perform an “upsert” on an iceberg table a.k.a. MERGE command. There is a lot to unpack on this PR, so I’ll start with some high level items and go into more detail. For basic terminology, an upsert/merge is where you take a new dataset and merge it to the target Iceberg table by doing an update on the target table for rows that have changed and an insert for new rows. This is all performed atomically as 1 transaction, meaning both the update and the insert succeed together, or they fail; this ensures the table is not left in an unknown state if one of the actions were able to succeed but the other errored out. Over the last decade, the ANSI SQL MERGE command has evolved a lot to handle more than just update existing rows, insert new rows. This PR aims to just cover the BASIC upsert pattern; it has been architected to allow more of the ANSI MERGE command features over time (via the merge_options paramater) to be included if others would like to contribute, such as “WHEN NOT MATCHED BY SOURCE THEN DELETE”. In order to efficiently process an upsert on the Iceberg table, it requires an engine to detect what rows have changed and what rows are new. The engine I have chosen for this new feature is datafusion, which is a rust based high performance pyarrow data processing engine; I realize this introduces a new dependency for the pyiceberg project, which i've flagged as optional in the pyproject.toml file. My rational on choosing datafusion is this is also the same library that is being used for the iceberg-rust project. Thus, down the road, remediation of this merge command to the more modern iceberg-rust implementation should be minimal. As far as test coverage is concerned, my unit tests cover performing upserts on both single key and composite key Iceberg tables (meaning the table has more than 1 key field it needs to join on). For the performance testing, single key tables scale just fine. My unit tests does a 10k update/insert for one of the series of tets, but I have on my local workstation (Mac M2 pro) ran a test of 500k rows of updates/inserts on a single key join and it scales just fine. Where I am hitting a wall on performance and am needing other’s help here is when you want to use a composite key. The composite key code builds an overwrite filter, and once that filter gets too lengthy (in my testing more than 200 rows), the visitor “OR” function in pyiceberg hits a recursion depth error. I took a hard look at the visitor code in the expressions folder and I was hesitant to try and change any of those functions due to me not having a clear understanding of how they work, and this is where I need other’s help. If you want to see this scaling cocern, simply update the paramater on my test_merge_scenario_3_composite_key def to generate a target table with 1000 rows and a source table with 2000 rows and you will see the error surface. I don't think its smart nor pratical to try and change the python recursion depth default limit because we will still hit the wall at some point unless the visitor "OR" function gets reworked to avoid recursion. Some other ideas i've kicked around to try and mitigate this, is to have the merge_rows code temporarily modify the source dataframe and the target iceberg table to build a concatenated single key of all the composite keys to join on e.g. "col1_col2_col[n]...". But that would require modifying the entire iceberg target table with a full overwrite, which then defeats the purpose of an upsert to be able to run incremental updates on a table not have to overwrite the entire table. I think this PR is a good first step to finally realizing MERGE/upsert support for pyiceberg and gets a piercing in the armor. Again, I’m welcome to other’s feedback on this topic and look forward to partnering with you all on this journey. Thanks, Matt Martin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org