aokolnychyi commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1769699649


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -81,6 +81,7 @@ public String partition() {
 
   // cache filtered manifests to avoid extra work when commits fail.
   private final Map<ManifestFile, ManifestFile> filteredManifests = 
Maps.newConcurrentMap();
+  private final Set<String> manifestsWithDeletedFiles = 
Sets.newConcurrentHashSet();

Review Comment:
   I actually thought it would be called like `deleteFileManifests` and behave 
similar to `deleteFilePartitions`. We will keep a flag similar to 
`hasPathOnlyDeletes` to know if `deleteFileManifests` can be trusted. The main 
idea is by knowing that all files to be deleted had `manifestLocation` set 
correctly, we can rely on the collected manifest set to reliably tell if a 
particular manifest has to be rewritten or not. In the current version of the 
PR, we only skip opening manifests that we have matches for. However, we still 
read all other manifests that don't have matches. If we know that  
`deleteFileManifests` can be trusted, we can completely avoid reading manifests 
to determine what to rewrite.
   
   Does it make sense?



##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -73,11 +74,14 @@ public class ReplaceDeleteFilesBenchmark {
   private List<DeleteFile> deleteFiles;
   private List<DeleteFile> pendingDeleteFiles;
 
-  @Param({"50000", "100000", "500000", "1000000", "2500000"})
+  @Param({"50000", "100000", "500000", "1000000", "2000000"})

Review Comment:
   I believe I used `jvmArgs = ['-Xmx32g']` in `jmh.gradle`. 2M sounds fine 
too. 



##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -104,27 +108,39 @@ private void dropTable() {
     TABLES.dropTable(TABLE_IDENT);
   }
 
-  private void initFiles() {
+  private void initFiles() throws IOException {
     List<DeleteFile> generatedDeleteFiles = 
Lists.newArrayListWithExpectedSize(numFiles);
     List<DeleteFile> generatedPendingDeleteFiles = 
Lists.newArrayListWithExpectedSize(numFiles);
 
     RowDelta rowDelta = table.newRowDelta();
+    int filesToDelete = (int) Math.ceil(numFiles * (percentDeleteFilesReplaced 
/ 100.0));
 
     for (int ordinal = 0; ordinal < numFiles; ordinal++) {
       DataFile dataFile = FileGenerationUtil.generateDataFile(table, null);
       rowDelta.addRows(dataFile);
-
-      DeleteFile deleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
-      rowDelta.addDeletes(deleteFile);
-      generatedDeleteFiles.add(deleteFile);
-
-      DeleteFile pendingDeleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
-      generatedPendingDeleteFiles.add(pendingDeleteFile);
+      if (filesToDelete > 0) {
+        DeleteFile deleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+        rowDelta.addDeletes(deleteFile);
+        generatedDeleteFiles.add(deleteFile);
+        DeleteFile pendingDeleteFile =
+            FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+        generatedPendingDeleteFiles.add(pendingDeleteFile);
+        filesToDelete--;

Review Comment:
   If I understand correctly, the new and the old logic replace all delete 
files. Your PR would benefit use cases when a small number of manifests has to 
be rewritten, so I'd probably add `numFiles` data and delete files but replace 
only `percentDeleteFilesReplaced`.
   
   Also, I would consider running the same benchmark for an unpartitioned table 
as it will not have the partition filters and Iceberg would be forced to scan 
through all of the metadata even if a single delete file is replaced.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to