dramaticlly commented on code in PR #12278: URL: https://github.com/apache/iceberg/pull/12278#discussion_r1958833411
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java: ########## @@ -84,11 +86,11 @@ * comparing the actual files in that location with content and metadata files referenced by all * valid snapshots. The location must be accessible for listing via the Hadoop {@link FileSystem}. * - * <p>By default, this action cleans up the table location returned by {@link Table#location()} and - * removes unreachable files that are older than 3 days using {@link Table#io()}. The behavior can - * be modified by passing a custom location to {@link #location} and a custom timestamp to {@link - * #olderThan(long)}. For example, someone might point this action to the data folder to clean up - * only orphan data files. + * <p>By default, this action cleans up data and metadata directory under the table location + * returned by {@link Table#location()} and removes unreachable files that are older than 3 days + * using {@link Table#io()}. The behavior can be modified by passing a custom location to {@link + * #location} and a custom timestamp to {@link #olderThan(long)}. For example, someone might point + * this action to the data folder to clean up only orphan data files. Review Comment: my $0.02, I think this introduce behaviour change for remove orphan files, would be great to have a email on dev@ to highlight the proposal and changes. Also I think we can just mention orphan removal is honoring `write.data.path` and `write.metadata.path` but allow for action/procedure level override if `location` is provided. (The default value of `write.data.path` and `write.metadata.path` can change independently and we dont need to mention Table#location) ########## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java: ########## @@ -1067,6 +1067,31 @@ public void testRemoveOrphanFileActionWithDeleteMode() { DeleteOrphanFiles.PrefixMismatchMode.DELETE); } + @TestTemplate + public void testRemoveOrphansFileInTableLocation() { + Table table = TABLES.create(SCHEMA, SPEC, Maps.newHashMap(), tableLocation); + + List<ThreeColumnRecord> records = + Lists.newArrayList(new ThreeColumnRecord(1, "AAAAAAAAAA", "AAAA")); + Dataset<Row> df = spark.createDataFrame(records, ThreeColumnRecord.class).coalesce(1); + + df.select("c1", "c2", "c3").write().format("iceberg").mode("append").save(tableLocation); + + // add a file to root table location + df.write().mode("append").parquet(tableLocation); + + long timestamp = System.currentTimeMillis(); + + waitUntilAfter(System.currentTimeMillis() + 1000L); + + SparkActions actions = SparkActions.get(); + + DeleteOrphanFiles.Result result = + actions.deleteOrphanFiles(table).olderThan(timestamp).execute(); + + assertThat(Iterables.size(result.orphanFileLocations())).isZero(); Review Comment: nit: this can be replaced with ```java assertThat(result.orphanFileLocations()).isEmpty(); ``` ########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java: ########## @@ -335,6 +347,21 @@ private Dataset<String> listedFileDS() { return spark().createDataset(completeMatchingFileRDD.rdd(), Encoders.STRING()); } + private String dataLocation() { Review Comment: `LocationUtil.stripTrailingSlash(table.locationProvider().newDataLocation(""))` can do, but it's not the prettiest and also we need one for `write.metadata.location`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org