Re: [PR] Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten [iceberg]

via GitHub Thu, 19 Sep 2024 20:13:42 -0700


amogh-jahagirdar commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1767892495



##########
core/src/test/java/org/apache/iceberg/TestRowDelta.java:
##########
@@ -1456,6 +1457,58 @@ public void testRewrittenDeleteFiles() {
         statuses(Status.DELETED));
   }
 
+  @TestTemplate
+  public void testRewrittenDeleteFilesReadFromManifest() throws IOException {
+    DataFile dataFile = newDataFile("data_bucket=0");
+    DeleteFile deleteFile = newDeleteFile(dataFile.specId(), "data_bucket=0");
+    RowDelta baseRowDelta = 
table.newRowDelta().addRows(dataFile).addDeletes(deleteFile);
+    Snapshot baseSnapshot = commit(table, baseRowDelta, branch);
+    assertThat(baseSnapshot.operation()).isEqualTo(DataOperations.OVERWRITE);
+    List<ManifestFile> deleteManifests = 
baseSnapshot.deleteManifests(table.io());
+    try (ManifestReader<DeleteFile> deleteReader =
+        ManifestFiles.readDeleteManifest(deleteManifests.get(0), table.io(), 
table.specs())) {
+      deleteFile = deleteReader.iterator().next();
+    }
+
+    
assertThat(deleteFile.manifestLocation()).isEqualTo(deleteManifests.get(0).path());

Review Comment:
   This test is mostly the same as the one above it, the only difference is we 
read the delete file from the manifest beforehand so that we hit the new code 
path.
   
   One thing I went back and forth on was the utility of this test, but I 
concluded it's better to have it than not. On the one hand, there's no good way 
to really assert we're exercising the new optimized path since that's all in 
internal classes (please let me know if I'm missing a way). On the other hand 
since it is a new path, I felt it's better to have a test which does end up 
hitting that path and verifies everything is OK.
   
   This combined with the fact that spark rewrite data files action tests 
exercise the optimized path should give reasonable confidence.



##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -73,11 +74,14 @@ public class ReplaceDeleteFilesBenchmark {
   private List<DeleteFile> deleteFiles;
   private List<DeleteFile> pendingDeleteFiles;
 
-  @Param({"50000", "100000", "500000", "1000000", "2500000"})
+  @Param({"50000", "100000", "500000", "1000000", "2000000"})

Review Comment:
   I modified the benchmark added in #11166 so that now we express a percent of 
data files which will have position deletes which will be replaced. I reduced 
this value since now we're opening up the manifest in the init of the test and 
with 2500000 files I get heap OOMs even after increasing the gradle heap memory 
to 16G. 2000000 seems to be OK and seems to be a useful enough benchmarking 
value? Let me know if you have any concerns @aokolnychyi .



##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -104,27 +108,39 @@ private void dropTable() {
     TABLES.dropTable(TABLE_IDENT);
   }
 
-  private void initFiles() {
+  private void initFiles() throws IOException {
     List<DeleteFile> generatedDeleteFiles = 
Lists.newArrayListWithExpectedSize(numFiles);
     List<DeleteFile> generatedPendingDeleteFiles = 
Lists.newArrayListWithExpectedSize(numFiles);
 
     RowDelta rowDelta = table.newRowDelta();
+    int filesToDelete = (int) Math.ceil(numFiles * (percentDeleteFilesReplaced 
/ 100.0));
 
     for (int ordinal = 0; ordinal < numFiles; ordinal++) {
       DataFile dataFile = FileGenerationUtil.generateDataFile(table, null);
       rowDelta.addRows(dataFile);
-
-      DeleteFile deleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
-      rowDelta.addDeletes(deleteFile);
-      generatedDeleteFiles.add(deleteFile);
-
-      DeleteFile pendingDeleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
-      generatedPendingDeleteFiles.add(pendingDeleteFile);
+      if (filesToDelete > 0) {
+        DeleteFile deleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+        rowDelta.addDeletes(deleteFile);
+        generatedDeleteFiles.add(deleteFile);
+        DeleteFile pendingDeleteFile =
+            FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+        generatedPendingDeleteFiles.add(pendingDeleteFile);
+        filesToDelete--;

Review Comment:
   Opting for keeping the benchmarking pretty simple for now. We just use the 
percentage and this will correlate linearly with the amount of manifests which 
need to get rewritten. We could test cases where we vary the distribution of 
deletes across partitions, but don't think it's strictly required now since the 
current benchmark does indicate this is a good improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten [iceberg]

Reply via email to