amogh-jahagirdar commented on code in PR #14287: URL: https://github.com/apache/iceberg/pull/14287#discussion_r2445951970
########## api/src/main/java/org/apache/iceberg/CleanupMode.java: ########## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +/** An enum representing possible clean up mode used in snapshot expiration. */ +public enum CleanupMode { Review Comment: Since this represents the different modes specifically for expiration, I think that this should be nested inside ExpireSnapshots itself. ########## core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java: ########## @@ -134,22 +135,24 @@ public void testExpireOlderThanWithDelete() { assertThat(table.snapshot(firstSnapshot.snapshotId())).isNull(); assertThat(table.snapshot(secondSnapshot.snapshotId())).isNull(); + Set<String> expectedDeletedFiles = Review Comment: Could we revert some of these unrelated test changes/refactorings, just to make the diff smaller and easier to review? Not opposed to the refactoring itself, but feel like that should be separate. ########## core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java: ########## @@ -1951,6 +1960,155 @@ private RemoveSnapshots removeSnapshots(Table table) { return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup); } + @TestTemplate + public void testCleanupModeAll() { + table.newAppend().appendFile(FILE_A).commit(); + Snapshot firstSnapshot = table.currentSnapshot(); + table.newDelete().deleteFile(FILE_A).commit(); + Snapshot secondSnapshot = table.currentSnapshot(); + table.newAppend().appendFile(FILE_B).commit(); + long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis()); + + Set<String> deletedFiles = Sets.newHashSet(); + + table + .expireSnapshots() + .cleanMode(CleanupMode.ALL) + .expireOlderThan(tAfterCommits) + .deleteWith(deletedFiles::add) + .commit(); + + assertThat(deletedFiles) + .as("CleanupMode.ALL should delete both metadata and data files") + .containsExactlyInAnyOrder( + firstSnapshot.manifestListLocation(), + firstSnapshot.allManifests(table.io()).get(0).path(), + secondSnapshot.manifestListLocation(), + secondSnapshot.allManifests(table.io()).get(0).path(), + FILE_A.location()); + } + + @TestTemplate + public void testCleanupModeMetadataOnly() { + table.newAppend().appendFile(FILE_A).commit(); + Snapshot firstSnapshot = table.currentSnapshot(); + table.newDelete().deleteFile(FILE_A).commit(); + Snapshot secondSnapshot = table.currentSnapshot(); + table.newAppend().appendFile(FILE_B).commit(); + long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis()); + + Set<String> deletedFiles = Sets.newHashSet(); + + table + .expireSnapshots() + .cleanMode(CleanupMode.METADATA_ONLY) + .expireOlderThan(tAfterCommits) + .deleteWith(deletedFiles::add) + .commit(); + + assertThat(deletedFiles) + .as("CleanupMode.METADATA_ONLY should delete only metadata files") + .containsExactlyInAnyOrder( + firstSnapshot.manifestListLocation(), + firstSnapshot.allManifests(table.io()).get(0).path(), + secondSnapshot.manifestListLocation(), + secondSnapshot.allManifests(table.io()).get(0).path()); + } + + @TestTemplate + public void testCleanupModeNone() { + table.newAppend().appendFile(FILE_A).commit(); + Snapshot firstSnapshot = table.currentSnapshot(); + table.newDelete().deleteFile(FILE_A).commit(); + Snapshot secondSnapshot = table.currentSnapshot(); + table.newAppend().appendFile(FILE_B).commit(); + long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis()); + + Set<String> deletedFiles = Sets.newHashSet(); + + table + .expireSnapshots() + .cleanMode(CleanupMode.NONE) + .expireOlderThan(tAfterCommits) + .deleteWith(deletedFiles::add) + .commit(); + + assertThat(table.snapshot(firstSnapshot.snapshotId())) + .as("First snapshot metadata should be removed even with CleanupMode.NONE") + .isNull(); + assertThat(table.snapshot(secondSnapshot.snapshotId())) + .as("Second snapshot metadata should be removed even with CleanupMode.NONE") + .isNull(); + + assertThat(deletedFiles).as("CleanupMode.NONE should not delete any files").isEmpty(); + } + + @TestTemplate + public void testDeprecatedMethodsStillWork() { Review Comment: I feel like `testCleanExpiredFilesApi()` should suffice and is more specific for which behavior we're trying to validate; for someone without context, it's unclear from the test name which API is being deprecated. ########## core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java: ########## @@ -735,7 +741,8 @@ public void dataFilesCleanupWithParallelTasks() throws IOException { assertThat(deletedFiles).contains(FILE_B.location()); assertThat(planThreadsIndex.get()) .as("Thread should be created in provided pool") - .isGreaterThan(0); + // incremental with retain will not use plan executor + .isGreaterThanOrEqualTo(0); Review Comment: I'm not quite sure why this changed? By default we should not be retaining (to preserve the existing behavior) ########## core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java: ########## @@ -45,18 +45,21 @@ public void accept(String file) { protected final FileIO fileIO; protected final ExecutorService planExecutorService; + protected final CleanupMode cleanupMode; private final Consumer<String> deleteFunc; private final ExecutorService deleteExecutorService; protected FileCleanupStrategy( FileIO fileIO, ExecutorService deleteExecutorService, ExecutorService planExecutorService, - Consumer<String> deleteFunc) { + Consumer<String> deleteFunc, + CleanupMode cleanupMode) { Review Comment: Minor point: Instead of setting this on the strategy itself, I think this should be passed as an option to the cleanFiles API `cleanFiles(TableMetadata before, TableMetadata after, CleanupMode mode)`. The way I think about this is I'm indicating how to cleanup as part of the individual cleanup call, not really a state of the strategy. ########## api/src/main/java/org/apache/iceberg/ExpireSnapshots.java: ########## @@ -116,9 +116,39 @@ public interface ExpireSnapshots extends PendingUpdate<List<Snapshot>> { * * @param clean setting this to false will skip deleting expired manifests and files * @return this for method chaining + * @deprecated since 1.10.0, will be removed in 2.0.0; use {@link #cleanMode(CleanupMode)} + * instead. */ + @Deprecated ExpireSnapshots cleanExpiredFiles(boolean clean); + /** + * Configures the cleanup mode for expired files. + * + * <p>This method provides fine-grained control over which files are cleaned up during snapshot + * expiration. The cleanup modes are: + * + * <ul> + * <li>{@link CleanupMode#ALL} - Clean up both metadata and data files (default) + * <li>{@link CleanupMode#METADATA_ONLY} - Clean up only metadata files (manifests, manifest Review Comment: metadata only will also cleanup statistics files as well, so I think we should call that out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
