amogh-jahagirdar commented on code in PR #9860: URL: https://github.com/apache/iceberg/pull/9860#discussion_r1511975142
########## core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java: ########## @@ -878,7 +886,7 @@ public Object updateEvent() { @SuppressWarnings("checkstyle:CyclomaticComplexity") private void cleanUncommittedAppends(Set<ManifestFile> committed) { - if (cachedNewDataManifests != null) { + if (!cachedNewDataManifests.isEmpty()) { Review Comment: Any reason we need to change this? I see you covered all the cases in this class where we do explicit null checks and replaced them with empty checks so I think logically we're good. More so just curious if it's required for this change (also I think nulling it out would be quicker than clearing, since clear goes through very element and nulls it out but that's probably minor). ########## core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java: ########## @@ -220,9 +233,15 @@ protected boolean addsDeleteFiles() { /** Add a data file to the new snapshot. */ protected void add(DataFile file) { Preconditions.checkNotNull(file, "Invalid data file: null"); - setDataSpec(file); - addedFilesSummary.addedFile(dataSpec(), file); + + PartitionSpec dataSpec = ops.current().spec(file.specId()); + Preconditions.checkNotNull( + dataSpec, "Cannot find partition spec for data file: %s", file.path()); Review Comment: Could we include the partition spec ID in the failure message? Also could we make this `Preconditions.checkArgument(dataSpec != null)`? I know it was already a notNull earlier but that leads to unclear NPEs as opposed to this case which I think really is an illegal argument exception (a data file with an invalid spec) ########## core/src/test/java/org/apache/iceberg/TestMergeAppend.java: ########## @@ -92,6 +94,74 @@ public void testEmptyTableAppend() { statuses(Status.ADDED, Status.ADDED)); } + @Test + public void testEmptyTableAppendFilesWithDifferentSpecs() { + assertThat(listManifestFiles()).as("Table should start empty").isEmpty(); + + TableMetadata base = readMetadata(); + assertThat(base.currentSnapshot()).as("Should not have a current snapshot").isNull(); + assertThat(base.lastSequenceNumber()).as("Last sequence number should be 0").isEqualTo(0); + + table.updateSpec().addField("id").commit(); + PartitionSpec newSpec = table.spec(); + + assertThat(table.specs().size()).as("Table should have 2 specs").isEqualTo(2); + + DataFile fileOriginalSpec = + DataFiles.builder(SPEC) + .withPath("/path/to/data-a.parquet") + .withPartitionPath("data_bucket=0") + .withFileSizeInBytes(10) + .withRecordCount(1) + .build(); + + DataFile fileNewSpec = + DataFiles.builder(newSpec) + .withPath("/path/to/data-b.parquet") + .withPartitionPath("data_bucket=0/id=0") + .withFileSizeInBytes(10) + .withRecordCount(1) + .build(); + + Snapshot committedSnapshot = + commit( + table, table.newAppend().appendFile(fileOriginalSpec).appendFile(fileNewSpec), branch); + + assertThat(committedSnapshot).as("Should create a snapshot").isNotNull(); + V1Assert.assertEquals( + "Last sequence number should be 0", 0, table.ops().current().lastSequenceNumber()); + V2Assert.assertEquals( + "Last sequence number should be 1", 1, table.ops().current().lastSequenceNumber()); + + assertThat(committedSnapshot.allManifests(table.io()).size()) + .as("Should create 2 manifests for initial write, 1 manifest per spec") + .isEqualTo(2); + + long snapshotId = committedSnapshot.snapshotId(); + + validateManifest( + committedSnapshot.allManifests(table.io()).stream() + .filter(m -> Objects.equals(m.partitionSpecId(), SPEC.specId())) + .findAny() + .get(), + dataSeqs(1L), + fileSeqs(1L), + ids(snapshotId), + files(fileOriginalSpec), + statuses(Status.ADDED)); + + validateManifest( + committedSnapshot.allManifests(table.io()).stream() + .filter(m -> Objects.equals(m.partitionSpecId(), newSpec.specId())) + .findAny() + .get(), + dataSeqs(1L), + fileSeqs(1L), + ids(snapshotId), + files(fileNewSpec), + statuses(Status.ADDED)); Review Comment: Nit: I think for this case you could just have a single validateManifest call and loop over the expected specs, something like: ``` for (Map.Entry<Long, DataFile> expectedFilesBySpec : ImmutableMap.of(newSpec, fileNewSpec, spec.specId(), fileOriginalSpec).entrySet()) { validateManifest( committedSnapshot.allManifests(table.io()).stream() .filter(m -> Objects.equals(m.partitionSpecId(), expectedFilesBySpec.getKey())) .findAny() .get(), dataSeqs(1L), fileSeqs(1L), ids(snapshotId), files(expectedFilesBySpec.getValue()), statuses(Status.ADDED)); } ``` ########## core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java: ########## @@ -878,7 +886,7 @@ public Object updateEvent() { @SuppressWarnings("checkstyle:CyclomaticComplexity") private void cleanUncommittedAppends(Set<ManifestFile> committed) { - if (cachedNewDataManifests != null) { + if (!cachedNewDataManifests.isEmpty()) { Review Comment: I see, we're doing the same thing for delete manifests as well. Should be fine. For context, I was particularly scrutinizing at this because this path was recently added as part of the 1.4.3 patch release to fix an issue where we were skipping added files as part of transaction retries: https://github.com/apache/iceberg/pull/9230 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org