syun64 commented on code in PR #569:
URL: https://github.com/apache/iceberg-python/pull/569#discussion_r1632276771
##########
pyiceberg/table/__init__.py:
##########
@@ -434,6 +456,9 @@ def overwrite(
if table_arrow_schema != df.schema:
df = df.cast(table_arrow_schema)
+ with
self.update_snapshot(snapshot_properties=snapshot_properties).delete() as
delete_snapshot:
+ delete_snapshot.delete_by_predicate(overwrite_filter)
+
with
self.update_snapshot(snapshot_properties=snapshot_properties).overwrite() as
update_snapshot:
# skip writing data files if the dataframe is empty
Review Comment:
@Fokko @kevinjqliu - I thought about this a bit more, and I think the order
does matter.
The reason is in how the order will interact with other metadata related
features within Iceberg.
The partitions metadata table is a great example. The partitions metadata
table is constructed by fetching the `snapshot_id` that's associated with a
specific partition. This is collected by fetching the `snapshot_id` where the
datafile was added (appended). If a user time travels to this ID, if the order
is `delete + append`, they will see a desired state of the table. If it is
`append + delete`, we will see the state of the iceberg table in the middle
state of the transaction.
The order is correct in the current implementation, but just wanted to point
this out for our own records
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]