qzyu999 commented on code in PR #3124:
URL: https://github.com/apache/iceberg-python/pull/3124#discussion_r2898648666


##########
pyiceberg/table/maintenance.py:
##########
@@ -43,3 +43,26 @@ def expire_snapshots(self) -> ExpireSnapshots:
         from pyiceberg.table.update.snapshot import ExpireSnapshots
 
         return ExpireSnapshots(transaction=Transaction(self.tbl, 
autocommit=True))
+
+    def compact(self) -> None:
+        """Compact the table's data files by reading and overwriting the 
entire table.
+
+        Note: This is a full-table compaction that leverages Arrow for 
binpacking.
+        It currently reads the entire table into memory via `.to_arrow()`.
+
+        This reads all existing data into memory and writes it back out using 
the
+        target file size settings (write.target-file-size-bytes), atomically
+        dropping the old files and replacing them with fewer, larger files.
+        """
+        # Read the current table state into memory
+        arrow_table = self.tbl.scan().to_arrow()
+
+        # Guard: if the table is completely empty, there's nothing to compact.
+        # Doing an overwrite with an empty table would result in deleting 
everything.
+        if arrow_table.num_rows == 0:
+            logger.info("Table contains no rows, skipping compaction.")
+            return
+
+        # Overwrite the table atomically (REPLACE operation)
+        with self.tbl.transaction() as txn:
+            txn.overwrite(arrow_table, snapshot_properties={"snapshot-type": 
"replace", "replace-operation": "compaction"})

Review Comment:
   Hi @kevinjqliu, thanks for the insight, I agree with what you're saying in 
terms of building a `replace` rather than just reusing the `overwrite`. I've 
refactored the compaction run to properly use a `.replace()` API, following the 
design of the Java Iceberg implementation.
   
   The approach is to create a new `_RewriteFiles` in 
`pyiceberg/table/update/snapshot.py`, which utilizes the new 
`Operation.REPLACE` from `pyiceberg/table/update/snapshots.py`. The 
`_RewriteFiles` utilizes the `replace()`, which effectively mimics the 
`_OverwriteFiles` operation, with the exception that it uses 
`Operation.REPLACE` instead of `Operation.OVERWRITE`. This allows 
`MaintenanceTable.compact()` to do a proper `txn.replace()` rather than reuse 
`txn.overwrite()`.
   
   I also think it's worth noting that by adding `Operation.REPLACE`, we make 
room for the needed rewrite manifests (#270) and delete orphan files 
functionality (#1200).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to