amogh-jahagirdar commented on code in PR #14287:
URL: https://github.com/apache/iceberg/pull/14287#discussion_r2436847973


##########
api/src/main/java/org/apache/iceberg/ExpireSnapshots.java:
##########
@@ -119,6 +119,17 @@ public interface ExpireSnapshots extends 
PendingUpdate<List<Snapshot>> {
    */
   ExpireSnapshots cleanExpiredFiles(boolean clean);
 
+  /**
+   * Skip the cleanup of orphaned data files as part of snapshot expiration
+   *
+   * @param retain true to retain orphaned data files only reachable by 
expired snapshots
+   * @return this for method chaining
+   */
+  default ExpireSnapshots retainOrphanedDataFiles(boolean retain) {

Review Comment:
   Thanks @dramaticlly , I mostly meant just using the table data location not 
a suffix as that wouldn't be sufficient but you're right that even the table 
data location is based on convention. 
   I agree though that specifying a custom cleanup function isn't the most 
ideal way to solve this use case.
   
   I think the best argument for this kind of option  is reducing costs 
(reducing unnecessary requests to read manifests, and the compute running when 
doing so) for cases where we know we only want to cleanup metadata and retain 
the data file. 
   
   The way I look at expressing this kind of logic is a bit different; rather 
than expressing which files we intend to retain, I look at it as which files 
should we cleanup. so something like a cleanExpiredFiles(CleanupMode mode)
   
   and the options for CleanupMode like ALL, METADATA_ONLY, NONE. I think on 
this path we'd deprecate the `cleanExpiredFiles(boolean)` option as well, no 
point keeping both APIs?
   
   If we were to go with the current approach we'd have to define precedence if 
someone uses the existing cleanExpiredFiles and the `retainDataFiles`,  which 
is why it seems cleaner to me to define a cleanExpiredFiles API with some 
modes? 



##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -59,16 +60,24 @@
 
 @ExtendWith(ParameterizedTestExtension.class)
 public class TestRemoveSnapshots extends TestBase {
+
   @Parameter(index = 1)
   private boolean incrementalCleanup;
 
-  @Parameters(name = "formatVersion = {0}, incrementalCleanup = {1}")
+  @Parameter(index = 2)
+  private boolean retainDataFile;
+
+  @Parameters(name = "formatVersion = {0}, incrementalCleanup = {1}, 
retainDataFile = {2}")
   protected static List<Object> parameters() {
     return Arrays.asList(
-        new Object[] {1, true},
-        new Object[] {2, true},
-        new Object[] {1, false},
-        new Object[] {2, false});
+        new Object[] {1, true, false},

Review Comment:
   Independent of the approach we take, I'm not sure we really need to 
parameterize here in the tests. I look at it as just having separate tests, one 
per cleanup mode or one per if data files are retained or not. I don't think 
it's super valuable to test that for every possible combination of expire 
snapshots because we're really just testing one dimension of the API for this 
change independent of which snapshots were removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to