karuppayya commented on code in PR #12278: URL: https://github.com/apache/iceberg/pull/12278#discussion_r1958822862
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java: ########## @@ -335,6 +347,21 @@ private Dataset<String> listedFileDS() { return spark().createDataset(completeMatchingFileRDD.rdd(), Encoders.STRING()); } + private String dataLocation() { Review Comment: Currently `org.apache.iceberg.io.LocationProvider` has APIs to only operate on data location. Also APIs needs an input in fileName. Is the suggestion to enhance the LocationProviders to include more APIs for data and metadata location? ########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java: ########## @@ -301,24 +303,34 @@ private Dataset<FileURI> actualFileIdentDS() { private Dataset<String> listedFileDS() { List<String> subDirs = Lists.newArrayList(); - List<String> matchingFiles = Lists.newArrayList(); + Set<String> matchingFiles = Sets.newHashSet(); Predicate<FileStatus> predicate = file -> file.getModificationTime() < olderThanTimestamp; PathFilter pathFilter = PartitionAwareHiddenPathFilter.forSpecs(table.specs()); + List<String> locationsToList = Lists.newArrayList(); + if (location.equals(table.location())) { + locationsToList.add(dataLocation()); + locationsToList.add(metadataFileLocation()); + } else { + locationsToList.add(location); + } Review Comment: This is the existing behavior to check the given location instead of table location, when specified via `org.apache.iceberg.actions.DeleteOrphanFiles#location` On a related note, the current code doesn't check `org.apache.iceberg.TableProperties#WRITE_DATA_LOCATION`, `org.apache.iceberg.TableProperties#WRITE_METADATA_LOCATION` which this PR check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org