Re: [I] Upsertion memory usage grows exponentially as table grows [iceberg-python]

via GitHub Mon, 23 Jun 2025 03:17:47 -0700


mwa28 commented on issue #2138:
URL: 
https://github.com/apache/iceberg-python/issues/2138#issuecomment-2995825344


   From what i was able to gather, the 
[`uspert()`](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L697)
 can benefit from a partition-awareness to do a partial scan instead of a full 
table scan.
   
   An AI-generated code fix would look like as follows :
   
   ```python
   def upsert(
       self,
       df: pa.Table,
       join_cols: Optional[List[str]] = None,
       when_matched_update_all: bool = True,
       when_not_matched_insert_all: bool = True,
       case_sensitive: bool = True,
   ) -> UpsertResult:
   # Get partition columns from the table spec
       partition_cols = [field.name for field in self.spec().fields]
   
       # Partition pruning: filter matched_iceberg_table to only relevant 
partitions
       if partition_cols:
           partition_values = 
df.select(partition_cols).to_pandas().drop_duplicates()
           mask = None
           for _, row in partition_values.iterrows():
               cond = functools.reduce(
                   operator.and_,
                   [pc.field(col) == row[col] for col in partition_cols]
               )
               mask = cond if mask is None else (mask | cond)
           if mask is not None and len(matched_iceberg_table) > 0:
               matched_iceberg_table = matched_iceberg_table.filter(mask)
   
   # ...rest of upsert logic...
   ```
   
   I have not reviewed what the AI generated code, it could be complete 
gibberish but i do agree on that there is a lack of partition awareness in the 
function which i thought would have been pre-baked into an iceberg table but it 
doesn't seem like it at the moment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Upsertion memory usage grows exponentially as table grows [iceberg-python]

Reply via email to