RussellSpitzer commented on issue #11541: URL: https://github.com/apache/iceberg/issues/11541#issuecomment-2481634096
> I have a few questions to help get started. > > 1. Is [TestRemoveOrphanFilesProcedure](https://github.com/apache/iceberg/blob/acd7cc1126b192ccb53ad8198bda37e983aa4c6c/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRemoveOrphanFilesProcedure.java#L70) a good place to add a new test for this? I found [this issue](https://github.com/apache/iceberg/issues/10539#issue-2362851350), maybe this is a good starting point for codifying the current problem? I think you could put new tests in here, but this may be a bit difficult to actually test since this will be a side effect within the action. > 2. > branch whenever a Hadoop class is about to be used and use FileIO > > Is the idea to branch in [DeleteOrphanFilesSparkAction](https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java) or in some other more general location(s)? Just DeleteOrphanFiles, there isn't another location that is using these classes I believe. The base actions are Hadoop agnostic. > 3. > we currently have everything we need in SupportPrefixOperations to allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles without using any Hadoop Classes > > I see `SupportPrefixOperations` has 5 FileIO implementations. Is the idea to lean on [ResolvingFileIO](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/io/ResolvingFileIO.java#L46-L50) for selecting the appropriate FileIO implementation? The idea is in places like https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L311 We would do a check of ```java if (table.io() instanceof SupportsPrefixOperations) { // listDirRecursivelyFileIO( location, predicate, .....) } else { // list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have // less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the driver listDirRecursively( //Probably rename this function to listDirRecursivelyHadoop location, predicate, hadoopConf.value(), MAX_DRIVER_LISTING_DEPTH, MAX_DRIVER_LISTING_DIRECT_SUB_DIRS, subDirs, pathFilter, matchingFiles); } ``` I haven't checked the rest of the code, but basically anywhere you see org.apache.hadoop classes we should check if the tableIO supports prefixOperations and if it does, branch and do something with prefix operations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org