kevinjqliu commented on PR #1878:
URL: https://github.com/apache/iceberg-python/pull/1878#issuecomment-2822958271

   So i took some time to think about this issue. 
   
   The main issue here is that pyarrow's `join` does not support complex types 
(regardless of whether the complex type is part of the join keys).
   
   Taking a step back, the 
[`get_rows_to_update`](https://github.com/apache/iceberg-python/blob/b85127e05b699f5e8f7acc1034b9f258fd209477/pyiceberg/table/__init__.py#L1204-L1207)
 function does 2 things
   1. it filters both the iceberg table (target table) and the upsert dataframe 
(source table) on the join keys. Any matching rows will be part of the 
potential "rows to update" table
   2. it does an extra optimization to avoid rewriting rows that are _exact_ 
match. This is done by matching the "none join keys". We filter out any rows 
that are exact match. 
   
   Both operations are combined by using the `join` function. 
   
   This PR currently skips step 2 as a fallback mechanism. All rows matching 
the join keys will be returned, regardless of whether its an exact match. I 
think this is fine since the overwrite will just overwrite more data.
   
   We should update this comment though,
   
https://github.com/apache/iceberg-python/blob/b85127e05b699f5e8f7acc1034b9f258fd209477/pyiceberg/table/__init__.py#L1204-L1207
   
   I cant think of efficient way to do step 2. I think it is possible to build 
a mask with all rows of the "none join keys" columns and then filter the "rows 
to update" table... 
   We are kind of already building the filter here
   
https://github.com/apache/iceberg-python/blob/b85127e05b699f5e8f7acc1034b9f258fd209477/pyiceberg/table/upsert_util.py#L74-L83


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to