Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]

via GitHub Wed, 09 Apr 2025 01:16:16 -0700


pvary commented on code in PR #12254:
URL: https://github.com/apache/iceberg/pull/12254#discussion_r2034758618



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -335,7 +344,39 @@ private Dataset<String> listedFileDS() {
     return spark().createDataset(completeMatchingFileRDD.rdd(), 
Encoders.STRING());
   }
 
-  private static void listDirRecursively(
+  private static void listDirRecursivelyWithFileIO(
+      SupportsPrefixOperations io,
+      String dir,
+      Predicate<org.apache.iceberg.io.FileInfo> predicate,
+      PathFilter pathFilter,
+      List<String> matchingFiles) {
+    String listPath = dir;
+    if (!dir.endsWith("/")) {
+      listPath = dir + "/";
+    }
+    Iterable<org.apache.iceberg.io.FileInfo> files = io.listPrefix(listPath);

Review Comment:
   We still will have performance issues if the `FileIO` implements 
`SupportsPrefixOperations` in a non-performant way. In this case instead of 
distributing the listing (as the original code does) we do it on a single node. 
(Note that the original code lists files/directories which are below the 
`MAX_DRIVER_LISTING_DEPTH`, then lists directories between 
`MAX_DRIVER_LISTING_DEPTH` and `MAX_DRIVER_LISTING_DIRECT_SUB_DIRS` and lists 
them using a distributed way.)
   
   I think the original authors of the code done this to ensure timely listing 
for huge tables. So I think simply changing the behavior would cause issues in 
their use-cases. We can do performance tests to show that the listings for the 
current `FileIO` implementations are performant enough to get rid of this 
optimization. Whatever we find, we need to discuss this on the dev list before 
moving forward.
   
   Once upon a time, long time ago, there were some discussions to make the 
file listing pluggable: 
https://github.com/apache/iceberg/issues/4346#issuecomment-1083412195
   
   I'm not sure what the guys mentioned that think about it now. This might be 
another option to discuss on the dev list if you see compelling reasons to 
accept the complexity it brings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]

Reply via email to