Fokko opened a new issue, #1679:
URL: https://github.com/apache/iceberg-python/issues/1679

   ### Feature Request / Improvement
   
   Right now we iterate [over all the 
rows](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/upsert_util.py)
 to find the rows that are different. Only these rows are updated, which is 
great. However, Arrow is probably faster since it pushes it down to C++. 
Therefore I came up with the following:
   
   ```python
   def get_rows_to_update(source_table: pa.Table, target_table: pa.Table, 
join_cols: list[str]) -> pa.Table:
       """
       Return a table with rows that need to be updated in the target table 
based on the join columns.
   
       When a row is matched, an additional scan is done to evaluate the 
non-key columns to detect if an actual change has occurred.
       Only matched rows that have an actual change to a non-key column value 
will be returned in the final output.
       """
       all_columns = set(source_table.column_names)
       join_cols_set = set(join_cols)
       non_key_cols = all_columns - join_cols_set
   
       diff_expr = functools.reduce(operator.or_, [pc.field(f"{col}-lhs") != 
pc.field(f"{col}-rhs") for col in non_key_cols])
   
       return (source_table
           .join(target_table, keys=list(join_cols_set), join_type='inner', 
left_suffix='-lhs', right_suffix='-rhs')
           .filter(diff_expr)
           .drop_columns([f"{col}-rhs" for col in non_key_cols])
           .rename_columns({f"{col}-lhs" if col not in join_cols else col: col 
for col in source_table.column_names})
       )
   ```
   
   Unfortunately, it looks like Arrow doesn't carry through the nullability of 
the column, resulting in an incompatible schema. Raised an issue on the Arrow 
side here https://github.com/apache/arrow/issues/45557


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to