[I] Upsertion memory usage grows exponentially as table size grows [iceberg-python]

via GitHub Thu, 05 Mar 2026 10:24:23 -0800


mwa28 opened a new issue, #2138:
URL: https://github.com/apache/iceberg-python/issues/2138


   ### Apache Iceberg version
   
   0.9.0
   
   ### Please describe the bug 🐞
   
   Hello,
   
   I am trying out the new upsert method on a table created as follows on the 
AWS Glue Catalog :
   
   ```python
   catalog = load_catalog("glue", **{"type": "glue"})
   schema = Schema(
       NestedField(1, "dt_insert", StringType(), required=True),
       NestedField(2, "controller_id", StringType(), required=True),
       NestedField(3, "timestamp", StringType(), required=True),
       NestedField(4, "data_type", StringType(), required=True),
       NestedField(5, "parameter", StringType(), required=True),
       NestedField(6, "value", StringType(), required=True),
       NestedField(7, "data_source", StringType(), required=True),
       NestedField(8, "unique_id", StringType(), required=True),
       NestedField(9, "dt_import_utc", StringType(), required=True),
       identifier_field_ids=[8],  # 'unique_id' is the primary key
   )
   catalog.create_table(
           identifier=<database>.<table_name>,
           schema=schema,
           partition_spec=PartitionSpec(
               PartitionField(
                   source_id=9,
                   field_id=1000,
                   transform=IdentityTransform(),
                   name="dt_import_utc",
               ),
           ),
           sort_order=SortOrder(
               SortField(source_id=2),  # controller_id
               SortField(source_id=4),  # data_type
               SortField(source_id=3),  # dt_insert
               SortField(source_id=1),  # timestamp
           ),
           location="s3://bucket/prefix/catalog",
           properties={
         "write.format.default" : "parquet",
         "write.target-file-size-bytes" : 134217728, # 128 MB
         "write.metadata.delete-after-commit.enabled" : True,
         "write.metadata.previous-versions-max" : 5
       })
   ```
   
   `unique_id` is a the merged string of columns 1, 2, 3, 4 and 5 with an 
underscore.
   
   My upsertion is a simple pandas dataframe dumped into a dict and being 
transformed to pyarrow for upsertion (I know pyarrow accepts directly a pandas 
df but doing this for an internal reason that requires writing the df in a 
stage and read in another, so i am dumping as json for readability). 
   
   ```python
   iceberg_table = catalog.load_table(table_name)
   table =  pa.Table.from_pydict(to_upsert, 
schema=iceberg_table.schema().as_arrow())
   iceberg_table.upsert(df=table, join_cols=["unique_id"])
   ```
   
   With AWS Athena, a MERGE INTO statement takes about 3 seconds to run on the 
table and scans 1.11MB of data before completion.
   
   ```sql
   MERGE INTO <database>.<table> target
               USING <database>.<temp_table> source
               ON (target."unique_id" = source."unique_id")
               WHEN MATCHED THEN
                   UPDATE SET "dt_insert" = source."dt_insert", "controller_id" 
= source."controller_id", "timestamp" = source."timestamp", "data_type" = 
source."data_type", "parameter" = source."parameter", "value" = source."value", 
"data_source" = source."data_source", "dt_import_utc" = source."dt_import_utc", 
"unique_id" = source."unique_id"
               WHEN NOT MATCHED THEN
                   INSERT ("dt_insert", "controller_id", "timestamp", 
"data_type", "parameter", "value", "data_source", "dt_import_utc", "unique_id")
                   VALUES (source."dt_insert", source."controller_id", 
source."timestamp", source."data_type", source."parameter", source."value", 
source."data_source", source."dt_import_utc", source."unique_id")
   ``` 
   (statement generated by using the `to_iceberg` from `awswrangler`)
   
   Meanwhile when i try with pyiceberg upsert, it is using more than 10240MB. I 
am running on AWS lambda and it is causing an out of memory error.
   
   I have no issue with the `append()` function, it completes fairly quickly 
but it seems that the upsert needs further optimization to be able to 
efficiently retrieve only relevant data. 
   
   Current table size is at 18.5GB for both Athena based statement and 
pyiceberg upsert function call.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Upsertion memory usage grows exponentially as table size grows [iceberg-python]

Reply via email to