Re: [I] Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files [iceberg]

via GitHub Sun, 17 Nov 2024 15:06:19 -0800


RussellSpitzer commented on issue #11541:
URL: https://github.com/apache/iceberg/issues/11541#issuecomment-2481634096


   > I have a few questions to help get started.
   > 
   > 1. Is 
[TestRemoveOrphanFilesProcedure](https://github.com/apache/iceberg/blob/acd7cc1126b192ccb53ad8198bda37e983aa4c6c/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRemoveOrphanFilesProcedure.java#L70)
 a good place to add a new test for this? I found [this 
issue](https://github.com/apache/iceberg/issues/10539#issue-2362851350), maybe 
this is a good starting point for codifying the current problem?
   
   I think you could put new tests in here, but this may be a bit difficult to 
actually test since this will be a side effect within the action.
   
   
   > 2. > branch whenever a Hadoop class is about to be used and use FileIO
   > 
   > Is the idea to branch in 
[DeleteOrphanFilesSparkAction](https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java)
 or in some other more general location(s)?
   
   Just DeleteOrphanFiles, there isn't another location that is using these 
classes I believe. The base actions are Hadoop agnostic.
   
   > 3. > we currently have everything we need in SupportPrefixOperations to 
allow a properly equipped FileIO instance to handle all of RemoveOrphanFiles 
without using any Hadoop Classes
   > 
   > I see `SupportPrefixOperations` has 5 FileIO implementations. Is the idea 
to lean on 
[ResolvingFileIO](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/io/ResolvingFileIO.java#L46-L50)
 for selecting the appropriate FileIO implementation?
   
   The idea is in places like 
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L311
   
   We would do a check of
   
   ```java
       if (table.io() instanceof SupportsPrefixOperations) {
         // listDirRecursivelyFileIO( location, predicate, .....)
       } else {
         // list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have
         // less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the 
driver
         listDirRecursively( //Probably rename this function to 
listDirRecursivelyHadoop
           location,
           predicate,
           hadoopConf.value(),
           MAX_DRIVER_LISTING_DEPTH,
           MAX_DRIVER_LISTING_DIRECT_SUB_DIRS,
           subDirs,
           pathFilter,
           matchingFiles);
       }
       ```
       
       I haven't checked the rest of the code, but basically anywhere you see 
org.apache.hadoop classes we should check if the tableIO supports 
prefixOperations and if it does, branch and do something with prefix operations.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Remove Dependency on Hadoop's Filesystem Class from Remove Orphan Files [iceberg]

Reply via email to