aokolnychyi commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1769699649
##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -81,6 +81,7 @@ public String partition() {
// cache filtered manifests to avoid extra work when commits fail.
private final Map<ManifestFile, ManifestFile> filteredManifests =
Maps.newConcurrentMap();
+ private final Set<String> manifestsWithDeletedFiles =
Sets.newConcurrentHashSet();
Review Comment:
I actually thought it would be called like `deleteFileManifests` and behave
similar to `deleteFilePartitions`. We will keep a flag similar to
`hasPathOnlyDeletes` to know if `deleteFileManifests` can be trusted. The main
idea is by knowing that all files to be deleted had `manifestLocation` set
correctly, we can rely on the collected manifest set to reliably tell if a
particular manifest has to be rewritten or not. In the current version of the
PR, we only skip opening manifests that we have matches for. However, we still
read all other manifests that don't have matches. If we know that
`deleteFileManifests` can be trusted, we can completely avoid reading manifests
to determine what to rewrite.
Does it make sense?
##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -73,11 +74,14 @@ public class ReplaceDeleteFilesBenchmark {
private List<DeleteFile> deleteFiles;
private List<DeleteFile> pendingDeleteFiles;
- @Param({"50000", "100000", "500000", "1000000", "2500000"})
+ @Param({"50000", "100000", "500000", "1000000", "2000000"})
Review Comment:
I believe I used `jvmArgs = ['-Xmx32g']` in `jmh.gradle`. 2M sounds fine
too.
##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -104,27 +108,39 @@ private void dropTable() {
TABLES.dropTable(TABLE_IDENT);
}
- private void initFiles() {
+ private void initFiles() throws IOException {
List<DeleteFile> generatedDeleteFiles =
Lists.newArrayListWithExpectedSize(numFiles);
List<DeleteFile> generatedPendingDeleteFiles =
Lists.newArrayListWithExpectedSize(numFiles);
RowDelta rowDelta = table.newRowDelta();
+ int filesToDelete = (int) Math.ceil(numFiles * (percentDeleteFilesReplaced
/ 100.0));
for (int ordinal = 0; ordinal < numFiles; ordinal++) {
DataFile dataFile = FileGenerationUtil.generateDataFile(table, null);
rowDelta.addRows(dataFile);
-
- DeleteFile deleteFile =
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
- rowDelta.addDeletes(deleteFile);
- generatedDeleteFiles.add(deleteFile);
-
- DeleteFile pendingDeleteFile =
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
- generatedPendingDeleteFiles.add(pendingDeleteFile);
+ if (filesToDelete > 0) {
+ DeleteFile deleteFile =
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+ rowDelta.addDeletes(deleteFile);
+ generatedDeleteFiles.add(deleteFile);
+ DeleteFile pendingDeleteFile =
+ FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+ generatedPendingDeleteFiles.add(pendingDeleteFile);
+ filesToDelete--;
Review Comment:
If I understand correctly, the new and the old logic replace all delete
files. Your PR would benefit use cases when a small number of manifests has to
be rewritten, so I'd probably add `numFiles` data and delete files but replace
only `percentDeleteFilesReplaced`.
Also, I would consider running the same benchmark for an unpartitioned table
as it will not have the partition filters and Iceberg would be forced to scan
through all of the metadata even if a single delete file is replaced.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]