kevinjqliu commented on PR #1878: URL: https://github.com/apache/iceberg-python/pull/1878#issuecomment-2822958271
So i took some time to think about this issue. The main issue here is that pyarrow's `join` does not support complex types (regardless of whether the complex type is part of the join keys). Taking a step back, the [`get_rows_to_update`](https://github.com/apache/iceberg-python/blob/b85127e05b699f5e8f7acc1034b9f258fd209477/pyiceberg/table/__init__.py#L1204-L1207) function does 2 things 1. it filters both the iceberg table (target table) and the upsert dataframe (source table) on the join keys. Any matching rows will be part of the potential "rows to update" table 2. it does an extra optimization to avoid rewriting rows that are _exact_ match. This is done by matching the "none join keys". We filter out any rows that are exact match. Both operations are combined by using the `join` function. This PR currently skips step 2 as a fallback mechanism. All rows matching the join keys will be returned, regardless of whether its an exact match. I think this is fine since the overwrite will just overwrite more data. We should update this comment though, https://github.com/apache/iceberg-python/blob/b85127e05b699f5e8f7acc1034b9f258fd209477/pyiceberg/table/__init__.py#L1204-L1207 I cant think of efficient way to do step 2. I think it is possible to build a mask with all rows of the "none join keys" columns and then filter the "rows to update" table... We are kind of already building the filter here https://github.com/apache/iceberg-python/blob/b85127e05b699f5e8f7acc1034b9f258fd209477/pyiceberg/table/upsert_util.py#L74-L83 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org