Re: [PR] Core: Support appending files with different specs [iceberg]

via GitHub Mon, 04 Mar 2024 16:58:29 -0800


amogh-jahagirdar commented on code in PR #9860:
URL: https://github.com/apache/iceberg/pull/9860#discussion_r1511975142



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -878,7 +886,7 @@ public Object updateEvent() {
 
   @SuppressWarnings("checkstyle:CyclomaticComplexity")
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewDataManifests != null) {
+    if (!cachedNewDataManifests.isEmpty()) {

Review Comment:
   Any reason we need to change this? I see you covered all the cases in this 
class where we do explicit null checks and replaced them with empty checks so I 
think logically we're good. More so just curious if it's required for this 
change (also I think nulling it out would be quicker than clearing, since clear 
goes through very element and nulls it out but that's probably minor).



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -220,9 +233,15 @@ protected boolean addsDeleteFiles() {
   /** Add a data file to the new snapshot. */
   protected void add(DataFile file) {
     Preconditions.checkNotNull(file, "Invalid data file: null");
-    setDataSpec(file);
-    addedFilesSummary.addedFile(dataSpec(), file);
+
+    PartitionSpec dataSpec = ops.current().spec(file.specId());
+    Preconditions.checkNotNull(
+        dataSpec, "Cannot find partition spec for data file: %s", file.path());

Review Comment:
   Could we include the partition spec ID in the failure message? Also could we 
make this `Preconditions.checkArgument(dataSpec != null)`? I know it was 
already a notNull earlier but that leads to unclear NPEs as opposed to this 
case which I think really is an illegal argument exception (a data file with an 
invalid spec)



##########
core/src/test/java/org/apache/iceberg/TestMergeAppend.java:
##########
@@ -92,6 +94,74 @@ public void testEmptyTableAppend() {
         statuses(Status.ADDED, Status.ADDED));
   }
 
+  @Test
+  public void testEmptyTableAppendFilesWithDifferentSpecs() {
+    assertThat(listManifestFiles()).as("Table should start empty").isEmpty();
+
+    TableMetadata base = readMetadata();
+    assertThat(base.currentSnapshot()).as("Should not have a current 
snapshot").isNull();
+    assertThat(base.lastSequenceNumber()).as("Last sequence number should be 
0").isEqualTo(0);
+
+    table.updateSpec().addField("id").commit();
+    PartitionSpec newSpec = table.spec();
+
+    assertThat(table.specs().size()).as("Table should have 2 
specs").isEqualTo(2);
+
+    DataFile fileOriginalSpec =
+        DataFiles.builder(SPEC)
+            .withPath("/path/to/data-a.parquet")
+            .withPartitionPath("data_bucket=0")
+            .withFileSizeInBytes(10)
+            .withRecordCount(1)
+            .build();
+
+    DataFile fileNewSpec =
+        DataFiles.builder(newSpec)
+            .withPath("/path/to/data-b.parquet")
+            .withPartitionPath("data_bucket=0/id=0")
+            .withFileSizeInBytes(10)
+            .withRecordCount(1)
+            .build();
+
+    Snapshot committedSnapshot =
+        commit(
+            table, 
table.newAppend().appendFile(fileOriginalSpec).appendFile(fileNewSpec), branch);
+
+    assertThat(committedSnapshot).as("Should create a snapshot").isNotNull();
+    V1Assert.assertEquals(
+        "Last sequence number should be 0", 0, 
table.ops().current().lastSequenceNumber());
+    V2Assert.assertEquals(
+        "Last sequence number should be 1", 1, 
table.ops().current().lastSequenceNumber());
+
+    assertThat(committedSnapshot.allManifests(table.io()).size())
+        .as("Should create 2 manifests for initial write, 1 manifest per spec")
+        .isEqualTo(2);
+
+    long snapshotId = committedSnapshot.snapshotId();
+
+    validateManifest(
+        committedSnapshot.allManifests(table.io()).stream()
+            .filter(m -> Objects.equals(m.partitionSpecId(), SPEC.specId()))
+            .findAny()
+            .get(),
+        dataSeqs(1L),
+        fileSeqs(1L),
+        ids(snapshotId),
+        files(fileOriginalSpec),
+        statuses(Status.ADDED));
+
+    validateManifest(
+        committedSnapshot.allManifests(table.io()).stream()
+            .filter(m -> Objects.equals(m.partitionSpecId(), newSpec.specId()))
+            .findAny()
+            .get(),
+        dataSeqs(1L),
+        fileSeqs(1L),
+        ids(snapshotId),
+        files(fileNewSpec),
+        statuses(Status.ADDED));

Review Comment:
   Nit: I think for this case you could just have a single validateManifest 
call and loop over the expected specs, something like:
   
   ```
   for (Map.Entry<Long, DataFile> expectedFilesBySpec : 
ImmutableMap.of(newSpec, fileNewSpec, spec.specId(), 
fileOriginalSpec).entrySet()) {
      validateManifest(
           committedSnapshot.allManifests(table.io()).stream()
               .filter(m -> Objects.equals(m.partitionSpecId(), 
expectedFilesBySpec.getKey()))
               .findAny()
               .get(),
           dataSeqs(1L),
           fileSeqs(1L),
           ids(snapshotId),
           files(expectedFilesBySpec.getValue()),
           statuses(Status.ADDED));
   }
   ```



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -878,7 +886,7 @@ public Object updateEvent() {
 
   @SuppressWarnings("checkstyle:CyclomaticComplexity")
   private void cleanUncommittedAppends(Set<ManifestFile> committed) {
-    if (cachedNewDataManifests != null) {
+    if (!cachedNewDataManifests.isEmpty()) {

Review Comment:
   I see, we're doing the same thing for delete manifests as well. Should be 
fine. For context, I was particularly scrutinizing at this because this path 
was recently added as part of the 1.4.3 patch release to fix an issue where we 
were skipping added files as part of transaction retries: 
https://github.com/apache/iceberg/pull/9230 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Support appending files with different specs [iceberg]

Reply via email to