mwa28 commented on issue #2138: URL: https://github.com/apache/iceberg-python/issues/2138#issuecomment-2995825344
From what i was able to gather, the [`uspert()`](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L697) can benefit from a partition-awareness to do a partial scan instead of a full table scan. An AI-generated code fix would look like as follows : ```python def upsert( self, df: pa.Table, join_cols: Optional[List[str]] = None, when_matched_update_all: bool = True, when_not_matched_insert_all: bool = True, case_sensitive: bool = True, ) -> UpsertResult: # Get partition columns from the table spec partition_cols = [field.name for field in self.spec().fields] # Partition pruning: filter matched_iceberg_table to only relevant partitions if partition_cols: partition_values = df.select(partition_cols).to_pandas().drop_duplicates() mask = None for _, row in partition_values.iterrows(): cond = functools.reduce( operator.and_, [pc.field(col) == row[col] for col in partition_cols] ) mask = cond if mask is None else (mask | cond) if mask is not None and len(matched_iceberg_table) > 0: matched_iceberg_table = matched_iceberg_table.filter(mask) # ...rest of upsert logic... ``` I have not reviewed what the AI generated code, it could be complete gibberish but i do agree on that there is a lack of partition awareness in the function which i thought would have been pre-baked into an iceberg table but it doesn't seem like it at the moment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org