pvary commented on code in PR #12254: URL: https://github.com/apache/iceberg/pull/12254#discussion_r2034758618
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java: ########## @@ -335,7 +344,39 @@ private Dataset<String> listedFileDS() { return spark().createDataset(completeMatchingFileRDD.rdd(), Encoders.STRING()); } - private static void listDirRecursively( + private static void listDirRecursivelyWithFileIO( + SupportsPrefixOperations io, + String dir, + Predicate<org.apache.iceberg.io.FileInfo> predicate, + PathFilter pathFilter, + List<String> matchingFiles) { + String listPath = dir; + if (!dir.endsWith("/")) { + listPath = dir + "/"; + } + Iterable<org.apache.iceberg.io.FileInfo> files = io.listPrefix(listPath); Review Comment: We still will have performance issues if the `FileIO` implements `SupportsPrefixOperations` in a non-performant way. In this case instead of distributing the listing (as the original code does) we do it on a single node. (Note that the original code lists files/directories which are below the `MAX_DRIVER_LISTING_DEPTH`, then lists directories between `MAX_DRIVER_LISTING_DEPTH` and `MAX_DRIVER_LISTING_DIRECT_SUB_DIRS` and lists them using a distributed way.) I think the original authors of the code done this to ensure timely listing for huge tables. So I think simply changing the behavior would cause issues in their use-cases. We can do performance tests to show that the listings for the current `FileIO` implementations are performant enough to get rid of this optimization. Whatever we find, we need to discuss this on the dev list before moving forward. Once upon a time, long time ago, there were some discussions to make the file listing pluggable: https://github.com/apache/iceberg/issues/4346#issuecomment-1083412195 I'm not sure what the guys mentioned that think about it now. This might be another option to discuss on the dev list if you see compelling reasons to accept the complexity it brings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org