qzyu999 commented on code in PR #3124:
URL: https://github.com/apache/iceberg-python/pull/3124#discussion_r2898648666
##########
pyiceberg/table/maintenance.py:
##########
@@ -43,3 +43,26 @@ def expire_snapshots(self) -> ExpireSnapshots:
from pyiceberg.table.update.snapshot import ExpireSnapshots
return ExpireSnapshots(transaction=Transaction(self.tbl,
autocommit=True))
+
+ def compact(self) -> None:
+ """Compact the table's data files by reading and overwriting the
entire table.
+
+ Note: This is a full-table compaction that leverages Arrow for
binpacking.
+ It currently reads the entire table into memory via `.to_arrow()`.
+
+ This reads all existing data into memory and writes it back out using
the
+ target file size settings (write.target-file-size-bytes), atomically
+ dropping the old files and replacing them with fewer, larger files.
+ """
+ # Read the current table state into memory
+ arrow_table = self.tbl.scan().to_arrow()
+
+ # Guard: if the table is completely empty, there's nothing to compact.
+ # Doing an overwrite with an empty table would result in deleting
everything.
+ if arrow_table.num_rows == 0:
+ logger.info("Table contains no rows, skipping compaction.")
+ return
+
+ # Overwrite the table atomically (REPLACE operation)
+ with self.tbl.transaction() as txn:
+ txn.overwrite(arrow_table, snapshot_properties={"snapshot-type":
"replace", "replace-operation": "compaction"})
Review Comment:
Hi @kevinjqliu, thanks for the insight, I agree with what you're saying in
terms of building a `replace` rather than just reusing the `overwrite`. I've
refactored the compaction run to properly use a `.replace()` API, following the
design of the Java Iceberg implementation.
The approach is to create a new `_RewriteFiles` in
`pyiceberg/table/update/snapshot.py`, which utilizes the new
`Operation.REPLACE` from `pyiceberg/table/update/snapshots.py`. The
`_RewriteFiles` utilizes the `replace()`, which effectively mimics the
`_OverwriteFiles` operation, with the exception that it uses
`Operation.REPLACE` instead of `Operation.OVERWRITE`. This allows
`MaintenanceTable.compact()` to do a proper `txn.replace()` rather than reuse
`txn.overwrite()`.
I also think it's worth noting that by adding `Operation.REPLACE`, we make
room for the needed rewrite manifests (#270) and delete orphan files
functionality (#1200).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]