laskoviymishka commented on code in PR #1021:
URL: https://github.com/apache/iceberg-go/pull/1021#discussion_r3199996281
##########
manifest.go:
##########
@@ -2187,6 +2187,19 @@ func (b *DataFileBuilder) FirstRowID(id int64)
*DataFileBuilder {
return b
}
+// SetDataFileFirstRowID sets the first_row_id on an existing DataFile.
+// This is used by the snapshot producer to assign row IDs at write time for
v3 tables.
+// Returns false if the DataFile implementation does not support mutation.
+func SetDataFileFirstRowID(df DataFile, id int64) bool {
Review Comment:
I'd reconsider exposing this as a top-level mutator. The function
type-asserts to the unexported `*dataFile`, returns `false` silently for any
other implementation, and aliases the caller's pointer — so a `DataFile` the
caller still holds gets its `FirstRowIDField` mutated as a side effect. That's
a surprising contract for a public API, and the `bool` return treats what is
really an invariant (`*dataFile` is the only impl in this package) as a soft
optional.
Could we either fold this into `DataFileBuilder` (build a fresh `*dataFile`
with `FirstRowID` set, no in-place mutation) or keep the helper unexported
within the package? wdyt?
##########
table/snapshot_producers.go:
##########
@@ -733,7 +733,20 @@ func (sp *snapshotProducer) manifestProducer(content
iceberg.ManifestContent, fi
}
defer internal.CheckedClose(wr, &err)
+ // For v3 data manifests, assign first_row_id to each data file.
+ // Each file claims a contiguous range of row IDs starting from
NextRowID.
+ assignRowIDs := sp.txn.meta.formatVersion >= 3 && content ==
iceberg.ManifestContentData
+ nextRowID := sp.txn.meta.NextRowID()
Review Comment:
Even if we keep the per-file assignment, this read happens once at
producer-build time and the values get baked into the data-manifest avro. On
OCC retry, `rebuildSnapshotUpdates` (`table/table.go`) only rewrites the
manifest *list*, not the data manifest entries — so under contention the
per-file IDs stay at the stale `NextRowID` from the first attempt while
metadata's `NextRowID` has moved.
`validateAndUpdateRowLineage` (`table/metadata.go:474-478`) would then
reject the retry deterministically. The recently-landed manifest-list rebuild
was designed assuming data manifests can be reused as-is on retry;
pre-assigning IDs breaks that invariant.
Worth a v3 OCC retry test either way — the existing `occ_scenario_test.go`
patterns should fit. Could we either move the assignment to commit time (after
retries settle) or scrap pre-assignment in favor of inheritance?
##########
table/snapshot_producers.go:
##########
@@ -733,7 +733,20 @@ func (sp *snapshotProducer) manifestProducer(content
iceberg.ManifestContent, fi
}
defer internal.CheckedClose(wr, &err)
+ // For v3 data manifests, assign first_row_id to each data file.
+ // Each file claims a contiguous range of row IDs starting from
NextRowID.
+ assignRowIDs := sp.txn.meta.formatVersion >= 3 && content ==
iceberg.ManifestContentData
Review Comment:
I think the bigger question is whether we should be assigning per-file
`first_row_id` at all on ADDED entries. The v3 spec ("First Row ID
Inheritance") says ADDED data files carry null `first_row_id`; the
manifest-list writer assigns the manifest-level value, and readers inherit
per-file IDs by summing preceding `record_count`s. Java's
`MergingSnapshotProducer.appendManifest` explicitly rejects inputs where
`manifest.firstRowId() != null`.
iceberg-go already implements both halves of the inheritance model — list
writer at `manifest.go:1419-1425`, reader at `manifest.go:734-742`. After this
PR, manifests we write would carry non-null per-file IDs that Java's
`appendManifest` would refuse to replay, and the per-file values become a
second source of truth alongside the manifest-level one.
Is there a reason we're diverging from inheritance here? Or could we drop
the per-file assignment and rely on what's already wired up?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]