mattmartin14 commented on PR #1534: URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2635079402
Alright @Fokko @tscottcoombes1 , some good news. I just pushed an update that removes the dependency of datafusion on the main pyiceberg merge_rows function. My test file still uses datafusion to generate the test datasets, which I'm told is ok. The two functions I ask you give careful scrutiny to are in the merge_rows_util.py file. They are called: - get_rows_to_update - get_rows_to_insert Given this is my first go at pyarrow filters, there is probably some optimization or functional changes that could be done to make it better. I'm open to suggestions, but in summary, we have gotten rid of datafusion at this point, and even though we are having to loop to compare rows and apply filters (no other way I really know of), we are not using pyarrow joins. Thanks for all the help getting here. I'm curious what updates in the merge_rows function will remain. Thanks, Matt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org