Re: [PR] List data and metadata directories instead of table root [iceberg]

via GitHub Mon, 17 Feb 2025 15:01:16 -0800


dramaticlly commented on code in PR #12278:
URL: https://github.com/apache/iceberg/pull/12278#discussion_r1958833411



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -84,11 +86,11 @@
  * comparing the actual files in that location with content and metadata files 
referenced by all
  * valid snapshots. The location must be accessible for listing via the Hadoop 
{@link FileSystem}.
  *
- * <p>By default, this action cleans up the table location returned by {@link 
Table#location()} and
- * removes unreachable files that are older than 3 days using {@link 
Table#io()}. The behavior can
- * be modified by passing a custom location to {@link #location} and a custom 
timestamp to {@link
- * #olderThan(long)}. For example, someone might point this action to the data 
folder to clean up
- * only orphan data files.
+ * <p>By default, this action cleans up data and metadata directory under the 
table location
+ * returned by {@link Table#location()} and removes unreachable files that are 
older than 3 days
+ * using {@link Table#io()}. The behavior can be modified by passing a custom 
location to {@link
+ * #location} and a custom timestamp to {@link #olderThan(long)}. For example, 
someone might point
+ * this action to the data folder to clean up only orphan data files.

Review Comment:
   my $0.02, I think this introduce behaviour change for remove orphan files, 
would be great to have a email on dev@ to highlight the proposal and changes. 
   Also I think we can just mention orphan removal is honoring 
`write.data.path` and `write.metadata.path` but allow for action/procedure 
level override if `location` is provided. (The default value of 
`write.data.path` and `write.metadata.path` can change independently and we 
dont need to mention Table#location)



##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java:
##########
@@ -1067,6 +1067,31 @@ public void testRemoveOrphanFileActionWithDeleteMode() {
         DeleteOrphanFiles.PrefixMismatchMode.DELETE);
   }
 
+  @TestTemplate
+  public void testRemoveOrphansFileInTableLocation() {
+    Table table = TABLES.create(SCHEMA, SPEC, Maps.newHashMap(), 
tableLocation);
+
+    List<ThreeColumnRecord> records =
+        Lists.newArrayList(new ThreeColumnRecord(1, "AAAAAAAAAA", "AAAA"));
+    Dataset<Row> df = spark.createDataFrame(records, 
ThreeColumnRecord.class).coalesce(1);
+
+    df.select("c1", "c2", 
"c3").write().format("iceberg").mode("append").save(tableLocation);
+
+    // add a file to root table location
+    df.write().mode("append").parquet(tableLocation);
+
+    long timestamp = System.currentTimeMillis();
+
+    waitUntilAfter(System.currentTimeMillis() + 1000L);
+
+    SparkActions actions = SparkActions.get();
+
+    DeleteOrphanFiles.Result result =
+        actions.deleteOrphanFiles(table).olderThan(timestamp).execute();
+
+    assertThat(Iterables.size(result.orphanFileLocations())).isZero();

Review Comment:
   nit: this can be replaced with 
   ```java
       assertThat(result.orphanFileLocations()).isEmpty();
   ```



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -335,6 +347,21 @@ private Dataset<String> listedFileDS() {
     return spark().createDataset(completeMatchingFileRDD.rdd(), 
Encoders.STRING());
   }
 
+  private String dataLocation() {

Review Comment:
   
`LocationUtil.stripTrailingSlash(table.locationProvider().newDataLocation(""))` 
can do, but it's not the prettiest and also we need one for 
`write.metadata.location`. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] List data and metadata directories instead of table root [iceberg]

Reply via email to