kevinjqliu commented on PR #1246: URL: https://github.com/apache/iceberg-python/pull/1246#issuecomment-2439020328
> We have encountered a data loss issue when using pyIceberg to perform an overwrite operation. Typically, an overwrite operation involves creating both a delete snapshot and an append snapshot. However, if an exception occurs during the creation of the append snapshot, the current code still attempts to commit the delete snapshot, leading to potential data loss. Im a bit confused on the chain of events. Here's what I found digging through the code: `table.overwrite` creates a transaction and calls its `overwrite` function https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/__init__.py#L1044-L1045 In the transaction's `overwrite` function, it calls both `self.delete` and `self.update_snapshot(snapshot_properties=snapshot_properties).fast_append()` https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/__init__.py#L507-L516 `self.delete` ultimately creates a `UpdateSnapshot` (`_OverwriteFiles`) https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/__init__.py#L594-L600 and `self.update_snapshot(snapshot_properties=snapshot_properties).fast_append()` also creates a `UpdateSnapshot` (`_FastAppendFiles`). https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/__init__.py#L594-L600 Both `_OverwriteFiles` and `_FastAppendFiles` subclass `_SnapshotProducer` which combines with `UpdateTableMetadata` updates the transaction https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/update/__init__.py#L62-L70 https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/update/snapshot.py#L241-L279 At this point, nothing has been committed yet. All updates are queued up in the transaction. `commit_transaction` is used to apply the changes in the transaction. For the above scenario, all updates are applied as one transaction. This transaction is either accepted or rejected as a whole. So there cannot be a scenario where the deletes are applied while the append is not -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org