mwa28 opened a new issue, #2138: URL: https://github.com/apache/iceberg-python/issues/2138
### Apache Iceberg version 0.9.0 ### Please describe the bug 🐞 Hello, I am trying out the new upsert method on a table created as follows on the AWS Glue Catalog : ```python catalog = load_catalog("glue", **{"type": "glue"}) schema = Schema( NestedField(1, "dt_insert", StringType(), required=True), NestedField(2, "controller_id", StringType(), required=True), NestedField(3, "timestamp", StringType(), required=True), NestedField(4, "data_type", StringType(), required=True), NestedField(5, "parameter", StringType(), required=True), NestedField(6, "value", StringType(), required=True), NestedField(7, "data_source", StringType(), required=True), NestedField(8, "unique_id", StringType(), required=True), NestedField(9, "dt_import_utc", StringType(), required=True), identifier_field_ids=[8], # 'unique_id' is the primary key ) catalog.create_table( identifier=<database>.<table_name>, schema=schema, partition_spec=PartitionSpec( PartitionField( source_id=9, field_id=1000, transform=IdentityTransform(), name="dt_import_utc", ), ), sort_order=SortOrder( SortField(source_id=2), # controller_id SortField(source_id=4), # data_type SortField(source_id=3), # dt_insert SortField(source_id=1), # timestamp ), location="s3://bucket/prefix/catalog", properties={ "write.format.default" : "parquet", "write.target-file-size-bytes" : 134217728, # 128 MB "write.metadata.delete-after-commit.enabled" : True, "write.metadata.previous-versions-max" : 5 }) ``` `unique_id` is a the merged string of columns 1, 2, 3, 4 and 5 with an underscore. My upsertion is a simple pandas dataframe being transformed in pyarrow for upsertion. ```python iceberg_table = catalog.load_table(table_name) table = pa.Table.from_pydict(to_upsert, schema=iceberg_table.schema().as_arrow()) iceberg_table.upsert(df=table, join_cols=["unique_id"]) ``` With AWS Athena, a MERGE INTO statement takes about 3 seconds to run on the table and scans 1.11MB of data before completion. ```sql MERGE INTO <database>.<table> target USING <database>.<temp_table> source ON (target."unique_id" = source."unique_id") WHEN MATCHED THEN UPDATE SET "dt_insert" = source."dt_insert", "controller_id" = source."controller_id", "timestamp" = source."timestamp", "data_type" = source."data_type", "parameter" = source."parameter", "value" = source."value", "data_source" = source."data_source", "dt_import_utc" = source."dt_import_utc", "unique_id" = source."unique_id", "sort_id" = source."sort_id" WHEN NOT MATCHED THEN INSERT ("dt_insert", "controller_id", "timestamp", "data_type", "parameter", "value", "data_source", "dt_import_utc", "unique_id", "sort_id") VALUES (source."dt_insert", source."controller_id", source."timestamp", source."data_type", source."parameter", source."value", source."data_source", source."dt_import_utc", source."unique_id", source."sort_id") ``` (statement generated by using the `to_iceberg` from `awswrangler`) Meanwhile when i try with pyiceberg upsert, it is using more than 10240MB. I am running on AWS lambda and it is causing an out of memory error. I have no issue with the `append()` function, it completes fairly quickly but it seems that the upsert needs further optimization to be able to efficiently retrieve only relevant data. Current table size is at 18.5GB for both Athena based statement and pyiceberg upsert function call. ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [x] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org