kevinjqliu commented on PR #1246:
URL: https://github.com/apache/iceberg-python/pull/1246#issuecomment-2439020328

   > We have encountered a data loss issue when using pyIceberg to perform an 
overwrite operation. Typically, an overwrite operation involves creating both a 
delete snapshot and an append snapshot. However, if an exception occurs during 
the creation of the append snapshot, the current code still attempts to commit 
the delete snapshot, leading to potential data loss.
   
   Im a bit confused on the chain of events. Here's what I found digging 
through the code:
   
   `table.overwrite` creates a transaction and calls its `overwrite` function
   
https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/__init__.py#L1044-L1045
   
   In the transaction's `overwrite` function, it calls both `self.delete` and 
`self.update_snapshot(snapshot_properties=snapshot_properties).fast_append()` 
   
https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/__init__.py#L507-L516
   
   `self.delete` ultimately creates a `UpdateSnapshot` (`_OverwriteFiles`)  
   
https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/__init__.py#L594-L600
   and 
`self.update_snapshot(snapshot_properties=snapshot_properties).fast_append()` 
also creates a `UpdateSnapshot` (`_FastAppendFiles`). 
   
https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/__init__.py#L594-L600
   
   Both `_OverwriteFiles` and `_FastAppendFiles` subclass `_SnapshotProducer` 
which combines with `UpdateTableMetadata` updates the transaction
   
https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/update/__init__.py#L62-L70
   
https://github.com/apache/iceberg-python/blob/de976fe1719882c1fc13f02950e82b4d894276aa/pyiceberg/table/update/snapshot.py#L241-L279
   
   At this point, nothing has been committed yet. All updates are queued up in 
the transaction. 
   `commit_transaction` is used to apply the changes in the transaction.
   For the above scenario, all updates are applied as one transaction. This 
transaction is either accepted or rejected as a whole. So there cannot be a 
scenario where the deletes are applied while the append is not
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to