Fokko opened a new issue, #1679: URL: https://github.com/apache/iceberg-python/issues/1679
### Feature Request / Improvement Right now we iterate [over all the rows](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/upsert_util.py) to find the rows that are different. Only these rows are updated, which is great. However, Arrow is probably faster since it pushes it down to C++. Therefore I came up with the following: ```python def get_rows_to_update(source_table: pa.Table, target_table: pa.Table, join_cols: list[str]) -> pa.Table: """ Return a table with rows that need to be updated in the target table based on the join columns. When a row is matched, an additional scan is done to evaluate the non-key columns to detect if an actual change has occurred. Only matched rows that have an actual change to a non-key column value will be returned in the final output. """ all_columns = set(source_table.column_names) join_cols_set = set(join_cols) non_key_cols = all_columns - join_cols_set diff_expr = functools.reduce(operator.or_, [pc.field(f"{col}-lhs") != pc.field(f"{col}-rhs") for col in non_key_cols]) return (source_table .join(target_table, keys=list(join_cols_set), join_type='inner', left_suffix='-lhs', right_suffix='-rhs') .filter(diff_expr) .drop_columns([f"{col}-rhs" for col in non_key_cols]) .rename_columns({f"{col}-lhs" if col not in join_cols else col: col for col in source_table.column_names}) ) ``` Unfortunately, it looks like Arrow doesn't carry through the nullability of the column, resulting in an incompatible schema. Raised an issue on the Arrow side here https://github.com/apache/arrow/issues/45557 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org