pvary commented on code in PR #12254:
URL: https://github.com/apache/iceberg/pull/12254#discussion_r2034758618
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -335,7 +344,39 @@ private Dataset<String> listedFileDS() {
return spark().createDataset(completeMatchingFileRDD.rdd(),
Encoders.STRING());
}
- private static void listDirRecursively(
+ private static void listDirRecursivelyWithFileIO(
+ SupportsPrefixOperations io,
+ String dir,
+ Predicate<org.apache.iceberg.io.FileInfo> predicate,
+ PathFilter pathFilter,
+ List<String> matchingFiles) {
+ String listPath = dir;
+ if (!dir.endsWith("/")) {
+ listPath = dir + "/";
+ }
+ Iterable<org.apache.iceberg.io.FileInfo> files = io.listPrefix(listPath);
Review Comment:
We still will have performance issues if the `FileIO` implements
`SupportsPrefixOperations` in a non-performant way. In this case instead of
distributing the listing (as the original code does) we do it on a single node.
(Note that the original code lists files/directories which are below the
`MAX_DRIVER_LISTING_DEPTH`, then lists directories between
`MAX_DRIVER_LISTING_DEPTH` and `MAX_DRIVER_LISTING_DIRECT_SUB_DIRS` and lists
them using a distributed way.)
I think the original authors of the code done this to ensure timely listing
for huge tables. So I think simply changing the behavior would cause issues in
their use-cases. We can do performance tests to show that the listings for the
current `FileIO` implementations are performant enough to get rid of this
optimization. Whatever we find, we need to discuss this on the dev list before
moving forward.
Once upon a time, long time ago, there were some discussions to make the
file listing pluggable:
https://github.com/apache/iceberg/issues/4346#issuecomment-1083412195
I'm not sure what the guys mentioned that think about it now. This might be
another option to discuss on the dev list if you see compelling reasons to
accept the complexity it brings.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]