amogh-jahagirdar commented on code in PR #11131: URL: https://github.com/apache/iceberg/pull/11131#discussion_r1767892495
########## core/src/test/java/org/apache/iceberg/TestRowDelta.java: ########## @@ -1456,6 +1457,58 @@ public void testRewrittenDeleteFiles() { statuses(Status.DELETED)); } + @TestTemplate + public void testRewrittenDeleteFilesReadFromManifest() throws IOException { + DataFile dataFile = newDataFile("data_bucket=0"); + DeleteFile deleteFile = newDeleteFile(dataFile.specId(), "data_bucket=0"); + RowDelta baseRowDelta = table.newRowDelta().addRows(dataFile).addDeletes(deleteFile); + Snapshot baseSnapshot = commit(table, baseRowDelta, branch); + assertThat(baseSnapshot.operation()).isEqualTo(DataOperations.OVERWRITE); + List<ManifestFile> deleteManifests = baseSnapshot.deleteManifests(table.io()); + try (ManifestReader<DeleteFile> deleteReader = + ManifestFiles.readDeleteManifest(deleteManifests.get(0), table.io(), table.specs())) { + deleteFile = deleteReader.iterator().next(); + } + + assertThat(deleteFile.manifestLocation()).isEqualTo(deleteManifests.get(0).path()); Review Comment: This test is mostly the same as the one above it, the only difference is we read the delete file from the manifest beforehand so that we hit the new code path. One thing I went back and forth on was the utility of this test, but I concluded it's better to have it than not. On the one hand, there's no good way to really assert we're exercising the new optimized path since that's all in internal classes (please let me know if I'm missing a way). On the other hand since it is a new path, I felt it's better to have a test which does end up hitting that path and verifies everything is OK. This combined with the fact that spark rewrite data files action tests exercise the optimized path should give reasonable confidence. ########## core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java: ########## @@ -73,11 +74,14 @@ public class ReplaceDeleteFilesBenchmark { private List<DeleteFile> deleteFiles; private List<DeleteFile> pendingDeleteFiles; - @Param({"50000", "100000", "500000", "1000000", "2500000"}) + @Param({"50000", "100000", "500000", "1000000", "2000000"}) Review Comment: I modified the benchmark added in #11166 so that now we express a percent of data files which will have position deletes which will be replaced. I reduced this value since now we're opening up the manifest in the init of the test and with 2500000 files I get heap OOMs even after increasing the gradle heap memory to 16G. 2000000 seems to be OK and seems to be a useful enough benchmarking value? Let me know if you have any concerns @aokolnychyi . ########## core/src/jmh/java/org/apache/iceberg/ReplaceDeleteFilesBenchmark.java: ########## @@ -104,27 +108,39 @@ private void dropTable() { TABLES.dropTable(TABLE_IDENT); } - private void initFiles() { + private void initFiles() throws IOException { List<DeleteFile> generatedDeleteFiles = Lists.newArrayListWithExpectedSize(numFiles); List<DeleteFile> generatedPendingDeleteFiles = Lists.newArrayListWithExpectedSize(numFiles); RowDelta rowDelta = table.newRowDelta(); + int filesToDelete = (int) Math.ceil(numFiles * (percentDeleteFilesReplaced / 100.0)); for (int ordinal = 0; ordinal < numFiles; ordinal++) { DataFile dataFile = FileGenerationUtil.generateDataFile(table, null); rowDelta.addRows(dataFile); - - DeleteFile deleteFile = FileGenerationUtil.generatePositionDeleteFile(table, dataFile); - rowDelta.addDeletes(deleteFile); - generatedDeleteFiles.add(deleteFile); - - DeleteFile pendingDeleteFile = FileGenerationUtil.generatePositionDeleteFile(table, dataFile); - generatedPendingDeleteFiles.add(pendingDeleteFile); + if (filesToDelete > 0) { + DeleteFile deleteFile = FileGenerationUtil.generatePositionDeleteFile(table, dataFile); + rowDelta.addDeletes(deleteFile); + generatedDeleteFiles.add(deleteFile); + DeleteFile pendingDeleteFile = + FileGenerationUtil.generatePositionDeleteFile(table, dataFile); + generatedPendingDeleteFiles.add(pendingDeleteFile); + filesToDelete--; Review Comment: Opting for keeping the benchmarking pretty simple for now. We just use the percentage and this will correlate linearly with the amount of manifests which need to get rewritten. We could test cases where we vary the distribution of deletes across partitions, but don't think it's strictly required now since the current benchmark does indicate this is a good improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org