syun64 commented on code in PR #569:
URL: https://github.com/apache/iceberg-python/pull/569#discussion_r1632276771


##########
pyiceberg/table/__init__.py:
##########
@@ -434,6 +456,9 @@ def overwrite(
         if table_arrow_schema != df.schema:
             df = df.cast(table_arrow_schema)
 
+        with 
self.update_snapshot(snapshot_properties=snapshot_properties).delete() as 
delete_snapshot:
+            delete_snapshot.delete_by_predicate(overwrite_filter)
+
         with 
self.update_snapshot(snapshot_properties=snapshot_properties).overwrite() as 
update_snapshot:
             # skip writing data files if the dataframe is empty

Review Comment:
   @Fokko @kevinjqliu - I thought about this a bit more, and I think the order 
does matter.
   
   The reason is in how the order will interact with other metadata related 
features within Iceberg.
   
   The partitions metadata table is a great example. The partitions metadata 
table is constructed by fetching the `snapshot_id` that's associated with a 
specific partition. This is collected by fetching the `snapshot_id` where the 
datafile was added (appended). If a user time travels to this ID, if the order 
is `delete + append`, they will see a desired state of the table. If it is 
`append + delete`, we will see the state of the iceberg table in the middle 
state of the transaction.
   
   The order is correct in the current implementation, but just wanted to point 
this out for our own records



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to