pkhetrapal opened a new issue, #7005:
URL: https://github.com/apache/iceberg/issues/7005

   ### Apache Iceberg version
   
   0.13.1
   
   ### Query engine
   
   Other
   
   ### Please describe the bug 🐞
   
   Using iceberg sink connector for AWS Glue 3.
   
   `MERGE` command produces duplicate records in case of backfilling the data. 
I want to make sure that I don't update or insert the record when the source 
(temp view) `updated_at` timestamp is less than or equal to the target (iceberg 
table) `updated_at` timestamp.
   
   Both of the below options produces duplicate records instead of updating the 
existing one.
   
   ```
   Table:
   
   CREATE TABLE IF NOT EXISTS table1
   USING iceberg
   PARTITIONED BY (bucket(2, _ID), days(UPDATED_AT))
        TBLPROPERTIES (
                'write.metadata.delete-after-commit.enabled'='true',
                'write.metadata.previous-versions-max'='10',
                'history.expire.max-snapshot-age-ms'='432000000',
                'history.expire.min-snapshots-to-keep'='10',
                'commit.manifest.min-count-to-merge'='25',
                'format'='parquet'
   )
   AS (SELECT * from global_temp.table2 LIMIT 0)
   ```
   
   ```
   Option 1.
   
   merge_update_sql = f"""
       MERGE INTO table1 t 
       USING (SELECT * FROM global_temp.table2) s 
       ON t._ID = s._ID
       WHEN MATCHED AND s.UPDATED_AT > t.UPDATED_AT THEN UPDATE SET *
   spark.sql(merge_update_sql)
   
   merge_insert_sql = f"""
       MERGE INTO table1 t 
       USING (SELECT * FROM global_temp.table2) s 
       ON t._ID = s._ID
       WHEN NOT MATCHED THEN INSERT *
   """
   spark.sql(merge_insert_sql)
   ```
   
   ```
   Option 2.
   
   merge_sql = f"""
       MERGE INTO table1 t 
       USING (SELECT * FROM global_temp.table2) s 
       ON t._ID = s._ID
       WHEN MATCHED AND s.UPDATED_AT > t.UPDATED_AT THEN UPDATE SET *
        WHEN MATCHED AND s.UPDATED_AT <= t.UPDATED_AT THEN UPDATE SET t._ID = 
s._ID
       WHEN NOT MATCHED THEN INSERT *
   spark.sql(merge_sql)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to