karuppayya commented on code in PR #12278:
URL: https://github.com/apache/iceberg/pull/12278#discussion_r1958822862


##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -335,6 +347,21 @@ private Dataset<String> listedFileDS() {
     return spark().createDataset(completeMatchingFileRDD.rdd(), 
Encoders.STRING());
   }
 
+  private String dataLocation() {

Review Comment:
   Currently `org.apache.iceberg.io.LocationProvider` has APIs to only operate 
on data location. Also APIs needs an input in fileName.
   Is the suggestion to enhance the LocationProviders to include more APIs for 
data and metadata location?



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -301,24 +303,34 @@ private Dataset<FileURI> actualFileIdentDS() {
 
   private Dataset<String> listedFileDS() {
     List<String> subDirs = Lists.newArrayList();
-    List<String> matchingFiles = Lists.newArrayList();
+    Set<String> matchingFiles = Sets.newHashSet();
 
     Predicate<FileStatus> predicate = file -> file.getModificationTime() < 
olderThanTimestamp;
     PathFilter pathFilter = 
PartitionAwareHiddenPathFilter.forSpecs(table.specs());
 
+    List<String> locationsToList = Lists.newArrayList();
+    if (location.equals(table.location())) {
+      locationsToList.add(dataLocation());
+      locationsToList.add(metadataFileLocation());
+    } else {
+      locationsToList.add(location);
+    }

Review Comment:
   This is the existing behavior to check the given location  instead of table 
location, when specified via 
`org.apache.iceberg.actions.DeleteOrphanFiles#location`
   On a related note, the current code doesn't check
   `org.apache.iceberg.TableProperties#WRITE_DATA_LOCATION`,
   `org.apache.iceberg.TableProperties#WRITE_METADATA_LOCATION`
   
   which this PR check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to