amogh-jahagirdar commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1767892495
##########
core/src/test/java/org/apache/iceberg/TestRowDelta.java:
##########
@@ -1456,6 +1457,58 @@ public void testRewrittenDeleteFiles() {
statuses(Status.DELETED));
}
+ @TestTemplate
+ public void testRewrittenDeleteFilesReadFromManifest() throws IOException {
+ DataFile dataFile = newDataFile("data_bucket=0");
+ DeleteFile deleteFile = newDeleteFile(dataFile.specId(), "data_bucket=0");
+ RowDelta baseRowDelta =
table.newRowDelta().addRows(dataFile).addDeletes(deleteFile);
+ Snapshot baseSnapshot = commit(table, baseRowDelta, branch);
+ assertThat(baseSnapshot.operation()).isEqualTo(DataOperations.OVERWRITE);
+ List<ManifestFile> deleteManifests =
baseSnapshot.deleteManifests(table.io());
+ try (ManifestReader<DeleteFile> deleteReader =
+ ManifestFiles.readDeleteManifest(deleteManifests.get(0), table.io(),
table.specs())) {
+ deleteFile = deleteReader.iterator().next();
+ }
+
+
assertThat(deleteFile.manifestLocation()).isEqualTo(deleteManifests.get(0).path());
Review Comment:
This test is mostly the same as the one above it, the only difference is we
read the delete file from the manifest beforehand so that we hit the new code
path.
One thing I went back and forth on was the utility of this test, but I
concluded it's better to have it than not. On the one hand, there's no good way
to really assert we're exercising the new optimized path since that's all in
internal classes (please let me know if I'm missing a way). On the other hand
since it is a new path, I felt it's better to have a test which does end up
hitting that path and verifies everything is OK.
This combined with the fact that spark rewrite data files action tests
exercise the optimized path should give reasonable confidence.
##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -73,11 +74,14 @@ public class ReplaceDeleteFilesBenchmark {
private List<DeleteFile> deleteFiles;
private List<DeleteFile> pendingDeleteFiles;
- @Param({"50000", "100000", "500000", "1000000", "2500000"})
+ @Param({"50000", "100000", "500000", "1000000", "2000000"})
Review Comment:
I modified the benchmark added in #11166 so that now we express a percent of
data files which will have position deletes which will be replaced. I reduced
this value since now we're opening up the manifest in the init of the test and
with 2500000 files I get heap OOMs even after increasing the gradle heap memory
to 16G. 2000000 seems to be OK and seems to be a useful enough benchmarking
value? Let me know if you have any concerns @aokolnychyi .
##########
core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java:
##########
@@ -104,27 +108,39 @@ private void dropTable() {
TABLES.dropTable(TABLE_IDENT);
}
- private void initFiles() {
+ private void initFiles() throws IOException {
List<DeleteFile> generatedDeleteFiles =
Lists.newArrayListWithExpectedSize(numFiles);
List<DeleteFile> generatedPendingDeleteFiles =
Lists.newArrayListWithExpectedSize(numFiles);
RowDelta rowDelta = table.newRowDelta();
+ int filesToDelete = (int) Math.ceil(numFiles * (percentDeleteFilesReplaced
/ 100.0));
for (int ordinal = 0; ordinal < numFiles; ordinal++) {
DataFile dataFile = FileGenerationUtil.generateDataFile(table, null);
rowDelta.addRows(dataFile);
-
- DeleteFile deleteFile =
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
- rowDelta.addDeletes(deleteFile);
- generatedDeleteFiles.add(deleteFile);
-
- DeleteFile pendingDeleteFile =
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
- generatedPendingDeleteFiles.add(pendingDeleteFile);
+ if (filesToDelete > 0) {
+ DeleteFile deleteFile =
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+ rowDelta.addDeletes(deleteFile);
+ generatedDeleteFiles.add(deleteFile);
+ DeleteFile pendingDeleteFile =
+ FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
+ generatedPendingDeleteFiles.add(pendingDeleteFile);
+ filesToDelete--;
Review Comment:
Opting for keeping the benchmarking pretty simple for now. We just use the
percentage and this will correlate linearly with the amount of manifests which
need to get rewritten. We could test cases where we vary the distribution of
deletes across partitions, but don't think it's strictly required now since the
current benchmark does indicate this is a good improvement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]