Re: [PR] HADOOP-18679. Add API for bulk/paged object deletion: Iceberg PoC [iceberg]

via GitHub Mon, 20 May 2024 09:00:09 -0700


steveloughran commented on code in PR #10233:
URL: https://github.com/apache/iceberg/pull/10233#discussion_r1606964288



##########
core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java:
##########
@@ -166,23 +178,106 @@ public void deletePrefix(String prefix) {
   @Override
   public void deleteFiles(Iterable<String> pathsToDelete) throws 
BulkDeletionFailureException {
     AtomicInteger failureCount = new AtomicInteger(0);
-    Tasks.foreach(pathsToDelete)
-        .executeWith(executorService())
-        .retry(DELETE_RETRY_ATTEMPTS)
-        .stopRetryOn(FileNotFoundException.class)
-        .suppressFailureWhenFinished()
-        .onFailure(
-            (f, e) -> {
-              LOG.error("Failure during bulk delete on file: {} ", f, e);
-              failureCount.incrementAndGet();
-            })
-        .run(this::deleteFile);
-
+    if (WrappedIO.isBulkDeleteAvailable()) {
+      failureCount.set(bulkDeleteFiles(pathsToDelete));
+    } else {
+      Tasks.foreach(pathsToDelete)
+          .executeWith(executorService())
+          .retry(DELETE_RETRY_ATTEMPTS)
+          .stopRetryOn(FileNotFoundException.class)
+          .suppressFailureWhenFinished()
+          .onFailure(
+              (f, e) -> {
+                LOG.error("Failure during bulk delete on file: {} ", f, e);
+                failureCount.incrementAndGet();
+              })
+          .run(this::deleteFile);
+    }
     if (failureCount.get() != 0) {
       throw new BulkDeletionFailureException(failureCount.get());
     }
   }
 
+  /**
+   * Bulk delete files.
+   * This has to support a list spanning multiple filesystems, so we group the 
paths by filesystem
+   * of schema + host.
+   * @param pathnames paths to delete.
+   * @return count of failures.
+   */
+  private int bulkDeleteFiles(Iterable<String> pathnames) {
+
+    LOG.debug("Using bulk delete operation to delete files");
+
+    SetMultimap<String, Path> fsMap =
+        Multimaps.newSetMultimap(Maps.newHashMap(), Sets::newHashSet);
+    List<Future<List<Map.Entry<Path, String>>>> deletionTasks = 
Lists.newArrayList();
+    for (String path : pathnames) {
+      Path p = new Path(path);
+      final URI uri = p.toUri();
+      String fsURI = uri.getScheme() + "://" + uri.getHost() + "/";

Review Comment:
   I'm using the root path of every filesystem instance; relies on the Path 
type's internal URI stuff. There are some s3 bucket names that aren't valid 
hostnames, for those the s3a policy is "not supported". dots in bucket names 
are a key example. Even amazon advise against those



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] HADOOP-18679. Add API for bulk/paged object deletion: Iceberg PoC [iceberg]

Reply via email to