Re: [PR] List data and metadata directories instead of table root [iceberg]

via GitHub Mon, 17 Feb 2025 02:32:12 -0800


Fokko commented on code in PR #12278:
URL: https://github.com/apache/iceberg/pull/12278#discussion_r1957869157



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -335,6 +347,21 @@ private Dataset<String> listedFileDS() {
     return spark().createDataset(completeMatchingFileRDD.rdd(), 
Encoders.STRING());
   }
 
+  private String dataLocation() {

Review Comment:
   If we go this route, I think it would be good to use the `LocationProvider` 
to get the data and metadata location.



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -301,24 +303,34 @@ private Dataset<FileURI> actualFileIdentDS() {
 
   private Dataset<String> listedFileDS() {
     List<String> subDirs = Lists.newArrayList();
-    List<String> matchingFiles = Lists.newArrayList();
+    Set<String> matchingFiles = Sets.newHashSet();
 
     Predicate<FileStatus> predicate = file -> file.getModificationTime() < 
olderThanTimestamp;
     PathFilter pathFilter = 
PartitionAwareHiddenPathFilter.forSpecs(table.specs());
 
+    List<String> locationsToList = Lists.newArrayList();
+    if (location.equals(table.location())) {
+      locationsToList.add(dataLocation());
+      locationsToList.add(metadataFileLocation());
+    } else {
+      locationsToList.add(location);
+    }

Review Comment:
   So if I'm reading this correctly, this will not check the table location for 
orphan files, I think that might be reasonable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] List data and metadata directories instead of table root [iceberg]

Reply via email to