laskoviymishka commented on code in PR #1021:
URL: https://github.com/apache/iceberg-go/pull/1021#discussion_r3199996281


##########
manifest.go:
##########
@@ -2187,6 +2187,19 @@ func (b *DataFileBuilder) FirstRowID(id int64) 
*DataFileBuilder {
        return b
 }
 
+// SetDataFileFirstRowID sets the first_row_id on an existing DataFile.
+// This is used by the snapshot producer to assign row IDs at write time for 
v3 tables.
+// Returns false if the DataFile implementation does not support mutation.
+func SetDataFileFirstRowID(df DataFile, id int64) bool {

Review Comment:
   I'd reconsider exposing this as a top-level mutator. The function 
type-asserts to the unexported `*dataFile`, returns `false` silently for any 
other implementation, and aliases the caller's pointer — so a `DataFile` the 
caller still holds gets its `FirstRowIDField` mutated as a side effect. That's 
a surprising contract for a public API, and the `bool` return treats what is 
really an invariant (`*dataFile` is the only impl in this package) as a soft 
optional.
   
   Could we either fold this into `DataFileBuilder` (build a fresh `*dataFile` 
with `FirstRowID` set, no in-place mutation) or keep the helper unexported 
within the package? wdyt?



##########
table/snapshot_producers.go:
##########
@@ -733,7 +733,20 @@ func (sp *snapshotProducer) manifestProducer(content 
iceberg.ManifestContent, fi
                }
                defer internal.CheckedClose(wr, &err)
 
+               // For v3 data manifests, assign first_row_id to each data file.
+               // Each file claims a contiguous range of row IDs starting from 
NextRowID.
+               assignRowIDs := sp.txn.meta.formatVersion >= 3 && content == 
iceberg.ManifestContentData
+               nextRowID := sp.txn.meta.NextRowID()

Review Comment:
   Even if we keep the per-file assignment, this read happens once at 
producer-build time and the values get baked into the data-manifest avro. On 
OCC retry, `rebuildSnapshotUpdates` (`table/table.go`) only rewrites the 
manifest *list*, not the data manifest entries — so under contention the 
per-file IDs stay at the stale `NextRowID` from the first attempt while 
metadata's `NextRowID` has moved.
   
   `validateAndUpdateRowLineage` (`table/metadata.go:474-478`) would then 
reject the retry deterministically. The recently-landed manifest-list rebuild 
was designed assuming data manifests can be reused as-is on retry; 
pre-assigning IDs breaks that invariant.
   
   Worth a v3 OCC retry test either way — the existing `occ_scenario_test.go` 
patterns should fit. Could we either move the assignment to commit time (after 
retries settle) or scrap pre-assignment in favor of inheritance?



##########
table/snapshot_producers.go:
##########
@@ -733,7 +733,20 @@ func (sp *snapshotProducer) manifestProducer(content 
iceberg.ManifestContent, fi
                }
                defer internal.CheckedClose(wr, &err)
 
+               // For v3 data manifests, assign first_row_id to each data file.
+               // Each file claims a contiguous range of row IDs starting from 
NextRowID.
+               assignRowIDs := sp.txn.meta.formatVersion >= 3 && content == 
iceberg.ManifestContentData

Review Comment:
   I think the bigger question is whether we should be assigning per-file 
`first_row_id` at all on ADDED entries. The v3 spec ("First Row ID 
Inheritance") says ADDED data files carry null `first_row_id`; the 
manifest-list writer assigns the manifest-level value, and readers inherit 
per-file IDs by summing preceding `record_count`s. Java's 
`MergingSnapshotProducer.appendManifest` explicitly rejects inputs where 
`manifest.firstRowId() != null`.
   
   iceberg-go already implements both halves of the inheritance model — list 
writer at `manifest.go:1419-1425`, reader at `manifest.go:734-742`. After this 
PR, manifests we write would carry non-null per-file IDs that Java's 
`appendManifest` would refuse to replay, and the per-file values become a 
second source of truth alongside the manifest-level one.
   
   Is there a reason we're diverging from inheritance here? Or could we drop 
the per-file assignment and rely on what's already wired up?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to