danielcweeks commented on code in PR #10233: URL: https://github.com/apache/iceberg/pull/10233#discussion_r1581856096
########## core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java: ########## @@ -166,23 +178,106 @@ public void deletePrefix(String prefix) { @Override public void deleteFiles(Iterable<String> pathsToDelete) throws BulkDeletionFailureException { AtomicInteger failureCount = new AtomicInteger(0); - Tasks.foreach(pathsToDelete) - .executeWith(executorService()) - .retry(DELETE_RETRY_ATTEMPTS) - .stopRetryOn(FileNotFoundException.class) - .suppressFailureWhenFinished() - .onFailure( - (f, e) -> { - LOG.error("Failure during bulk delete on file: {} ", f, e); - failureCount.incrementAndGet(); - }) - .run(this::deleteFile); - + if (WrappedIO.isBulkDeleteAvailable()) { + failureCount.set(bulkDeleteFiles(pathsToDelete)); + } else { + Tasks.foreach(pathsToDelete) + .executeWith(executorService()) + .retry(DELETE_RETRY_ATTEMPTS) + .stopRetryOn(FileNotFoundException.class) + .suppressFailureWhenFinished() + .onFailure( + (f, e) -> { + LOG.error("Failure during bulk delete on file: {} ", f, e); + failureCount.incrementAndGet(); + }) + .run(this::deleteFile); + } if (failureCount.get() != 0) { throw new BulkDeletionFailureException(failureCount.get()); } } + /** + * Bulk delete files. + * This has to support a list spanning multiple filesystems, so we group the paths by filesystem + * of schema + host. + * @param pathnames paths to delete. + * @return count of failures. + */ + private int bulkDeleteFiles(Iterable<String> pathnames) { + + LOG.debug("Using bulk delete operation to delete files"); + + SetMultimap<String, Path> fsMap = + Multimaps.newSetMultimap(Maps.newHashMap(), Sets::newHashSet); + List<Future<List<Map.Entry<Path, String>>>> deletionTasks = Lists.newArrayList(); + for (String path : pathnames) { Review Comment: Ideally we would want to maintain the streaming nature of the delete (e.g. collect and delete batches incrementally as opposed to collecting them all first), which is the intent of the input being iterable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org