Re: [PR] Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten [iceberg]

via GitHub Sat, 28 Sep 2024 12:11:49 -0700


amogh-jahagirdar commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1779738033



##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -81,6 +81,7 @@ public String partition() {
 
   // cache filtered manifests to avoid extra work when commits fail.
   private final Map<ManifestFile, ManifestFile> filteredManifests = 
Maps.newConcurrentMap();
+  private final Set<String> manifestsWithDeletedFiles = 
Sets.newConcurrentHashSet();

Review Comment:
   Opening back up this thread in case we wanted to discuss the field name 
since I haven't changed that yet:`manifestsWithDeletedFiles` vs 
`deleteManifests`.
   
   This change is keeping track of both data and delete manifests which have 
entries which are deleted (hence the name `manifestsWithDeletedFiles`). When we 
go to prune out which manifests *cannot* contain any deleted entries, that's 
when an additional check on delete manifests is performed since cherrypick 
overwrite cases make it so that detecting data manifests cannot contain deleted 
entries is not possible. 
   
   I could also only keep track of specifically the delete manifests and then 
rename to `deleteManifests` but the advantage that gets lost is the ability to 
detect if a data manifest has an entry which is deleted, we has to go through 
the old path.



##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -104,27 +108,39 @@ private void dropTable() {
     TABLES.dropTable(TABLE_IDENT);
   }
 
-  private void initFiles() {
+  private void initFiles() throws IOException {
     List<DeleteFile> generatedDeleteFiles = 
Lists.newArrayListWithExpectedSize(numFiles);
     List<DeleteFile> generatedPendingDeleteFiles = 
Lists.newArrayListWithExpectedSize(numFiles);
 
     RowDelta rowDelta = table.newRowDelta();
+    int filesToDelete = (int) Math.ceil(numFiles * (percentDeleteFilesReplaced 
/ 100.0));
 
     for (int ordinal = 0; ordinal < numFiles; ordinal++) {
       DataFile dataFile = FileGenerationUtil.generateDataFile(table, null);
       rowDelta.addRows(dataFile);
-
-      DeleteFile deleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
-      rowDelta.addDeletes(deleteFile);
-      generatedDeleteFiles.add(deleteFile);
-
-      DeleteFile pendingDeleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
-      generatedPendingDeleteFiles.add(pendingDeleteFile);
+      if (filesToDelete > 0) {
+        DeleteFile deleteFile = 
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+        rowDelta.addDeletes(deleteFile);
+        generatedDeleteFiles.add(deleteFile);
+        DeleteFile pendingDeleteFile =
+            FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+        generatedPendingDeleteFiles.add(pendingDeleteFile);
+        filesToDelete--;

Review Comment:
   I updated the benchmark to only replace a percentage of the delete files 
rather than before where the percentage was more a ratio of how many data files 
had delete files.  That's probably a more useful bench for evaluating 
synchronous maintenance since we get a sense of how expensive the operation is 
at varying levels of replacement.
   
   After updating to have the set be a source of truth for delete manifests, a 
more noticeable difference is observed, as expected! I updated the benchmark 
results in the PR description.



##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -325,7 +341,15 @@ private ManifestFile filterManifest(Schema tableSchema, 
ManifestFile manifest) {
     }
   }
 
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")
   private boolean canContainDeletedFiles(ManifestFile manifest) {
+    boolean manifestReferencedInDelete = 
manifestsWithDeletedFiles.contains(manifest.path());
+    if (manifest.content() == ManifestContent.DELETES && 
allDeletesReferenceManifests) {

Review Comment:
   Note: We could also simplify all this to only track delete manifests which 
have entries which are deleted. But I think that loses out on the ability for 
the true positive case where a data manifest has a deleted entry and we can 
quickly return true and skip the manifest read. cc @aokolnychyi let me know 
what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten [iceberg]

Reply via email to