aokolnychyi commented on code in PR #11131: URL: https://github.com/apache/iceberg/pull/11131#discussion_r1769699649
########## core/src/main/java/org/apache/iceberg/ManifestFilterManager.java: ########## @@ -81,6 +81,7 @@ public String partition() { // cache filtered manifests to avoid extra work when commits fail. private final Map<ManifestFile, ManifestFile> filteredManifests = Maps.newConcurrentMap(); + private final Set<String> manifestsWithDeletedFiles = Sets.newConcurrentHashSet(); Review Comment: I actually thought it would be called like `deleteFileManifests` and behave similar to `deleteFilePartitions`. We will keep a flag similar to `hasPathOnlyDeletes` to know if `deleteFileManifests` can be trusted. The main idea is by knowing that all files to be deleted had `manifestLocation` set correctly, we can rely on the collected manifest set to reliably tell if a particular manifest has to be rewritten or not. In the current version of the PR, we only skip opening manifests that we have matches for. However, we still read all other manifests that don't have matches. If we know that `deleteFileManifests` can be trusted, we can completely avoid reading manifests to determine what to rewrite. Does it make sense? ########## core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java: ########## @@ -73,11 +74,14 @@ public class ReplaceDeleteFilesBenchmark { private List<DeleteFile> deleteFiles; private List<DeleteFile> pendingDeleteFiles; - @Param({"50000", "100000", "500000", "1000000", "2500000"}) + @Param({"50000", "100000", "500000", "1000000", "2000000"}) Review Comment: I believe I used `jvmArgs = ['-Xmx32g']` in `jmh.gradle`. 2M sounds fine too. ########## core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java: ########## @@ -104,27 +108,39 @@ private void dropTable() { TABLES.dropTable(TABLE_IDENT); } - private void initFiles() { + private void initFiles() throws IOException { List<DeleteFile> generatedDeleteFiles = Lists.newArrayListWithExpectedSize(numFiles); List<DeleteFile> generatedPendingDeleteFiles = Lists.newArrayListWithExpectedSize(numFiles); RowDelta rowDelta = table.newRowDelta(); + int filesToDelete = (int) Math.ceil(numFiles * (percentDeleteFilesReplaced / 100.0)); for (int ordinal = 0; ordinal < numFiles; ordinal++) { DataFile dataFile = FileGenerationUtil.generateDataFile(table, null); rowDelta.addRows(dataFile); - - DeleteFile deleteFile = FileGenerationUtil.generatePositionDeleteFile(table, dataFile); - rowDelta.addDeletes(deleteFile); - generatedDeleteFiles.add(deleteFile); - - DeleteFile pendingDeleteFile = FileGenerationUtil.generatePositionDeleteFile(table, dataFile); - generatedPendingDeleteFiles.add(pendingDeleteFile); + if (filesToDelete > 0) { + DeleteFile deleteFile = FileGenerationUtil.generatePositionDeleteFile(table, dataFile); + rowDelta.addDeletes(deleteFile); + generatedDeleteFiles.add(deleteFile); + DeleteFile pendingDeleteFile = + FileGenerationUtil.generatePositionDeleteFile(table, dataFile); + generatedPendingDeleteFiles.add(pendingDeleteFile); + filesToDelete--; Review Comment: If I understand correctly, the new and the old logic replace all delete files. Your PR would benefit use cases when a small number of manifests has to be rewritten, so I'd probably add `numFiles` data and delete files but replace only `percentDeleteFilesReplaced`. Also, I would consider running the same benchmark for an unpartitioned table as it will not have the partition filters and Iceberg would be forced to scan through all of the metadata even if a single delete file is replaced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org