dramaticlly commented on code in PR #14287:
URL: https://github.com/apache/iceberg/pull/14287#discussion_r2418060119
##########
api/src/main/java/org/apache/iceberg/ExpireSnapshots.java:
##########
@@ -119,6 +119,17 @@ public interface ExpireSnapshots extends
PendingUpdate<List<Snapshot>> {
*/
ExpireSnapshots cleanExpiredFiles(boolean clean);
+ /**
+ * Skip the cleanup of orphaned data files as part of snapshot expiration
+ *
+ * @param retain true to retain orphaned data files only reachable by
expired snapshots
+ * @return this for method chaining
+ */
+ default ExpireSnapshots retainOrphanedDataFiles(boolean retain) {
Review Comment:
thanks @amogh-jahagirdar ! We actually explored that option and here's what
we find
1. use retainOrphanedDataFiles option actually speed up the clean up process
by avoiding open and read the manifest files, if only metadata (like
manifest-list and manifest) are considered for clean up, then we can actually
skip reading the manifests, which is usually the bottleneck and require work
distribution. Usually this is handled in Spark action and procedures
2. use DeleteWith consumer currently only provides a file path represented
in String, we can use its file suffix to differentiate metadata and data files,
but with introduction of #13769, we can no longer rely on `.parquet` alone to
tell. We can still probably rely on checking `$tablePath/data/` as part of file
path but this is mostly conventional
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]