stevenzwu commented on code in PR #16689:
URL: https://github.com/apache/iceberg/pull/16689#discussion_r3383148027
##########
core/src/main/java/org/apache/iceberg/TrackingBuilder.java:
##########
@@ -101,33 +100,35 @@ private TrackingBuilder(Tracking source, long
newSnapshotId) {
/** Indicates that the DV has been updated for the new Tracking. */
Review Comment:
All three mutators below — `dvUpdated`, `deletedPositions`,
`replacedPositions` — now drive the status machine implicitly (EXISTING →
MODIFIED, ADDED stays ADDED) and unconditionally advance `dvSnapshotId` to
`newSnapshotId`. None of that shows up in their Javadoc, and the same-commit
`ADDED` carve-out is the kind of rule a reader has to reverse-engineer from the
constructor + the if-guard. Worth spelling out on each method, e.g.:
```java
/**
* Marks the DV as updated by this commit, advancing {@code dvSnapshotId} to
the new commit
* snapshot. An EXISTING entry transitions to MODIFIED; an ADDED entry stays
ADDED (same-commit
* append + DV attach).
*/
TrackingBuilder dvUpdated() { ... }
/**
* Records manifest-leaf positions where data files were deleted by this
commit. Advances
* {@code dvSnapshotId} to the new commit snapshot and transitions an
EXISTING entry to MODIFIED.
* Cannot be called on an ADDED entry.
*/
TrackingBuilder deletedPositions(ByteBuffer positions) { ... }
```
Same shape for `replacedPositions`. Without this, the only place the `ADDED`
carve-out is documented is the test name.
##########
core/src/main/java/org/apache/iceberg/Tracking.java:
##########
@@ -28,13 +28,13 @@ interface Tracking {
0,
"status",
Types.IntegerType.get(),
- "Entry status: 0=existing, 1=added, 2=deleted, 3=replaced");
+ "Entry status: 0=existing, 1=added, 2=deleted, 3=replaced,
4=modified");
Types.NestedField SNAPSHOT_ID =
Types.NestedField.optional(
1,
"snapshot_id",
Types.LongType.get(),
- "Snapshot ID where the file was added or deleted");
+ "Snapshot ID where the file was added, deleted, replaced, or
modified");
Review Comment:
Description was extended to include `or modified`, but
`testDvUpdatedProducesModifiedAndAdvancesDvSnapshotId` and
`testCarryForwardFromModifiedSourceChangesToExisting` both pin `MODIFIED`
entries to `source.snapshotId()` (the original add commit) — only
`dv_snapshot_id` advances on a modify. The doc here doesn't match what
`TrackingBuilder` produces.
```suggestion
"Snapshot ID where the file was added, deleted, or replaced");
```
Orthogonal to rdblue's open question above on REPLACED: regardless of how
that lands, `MODIFIED` should come out of this description because it never
advances `snapshot_id`.
##########
core/src/test/java/org/apache/iceberg/TestTrackingStruct.java:
##########
@@ -433,6 +408,38 @@ void testExistingToTerminalTransitions() {
assertThat(replaced.snapshotId()).isEqualTo(999L);
}
+ @Test
+ void testExistingPreservesSourceSnapshotId() {
+ Tracking source = sourceTracking();
+ Tracking existing = TrackingBuilder.from(source, 999L).build();
+ assertThat(existing.status()).isEqualTo(EntryStatus.EXISTING);
+
assertThat(existing.snapshotId()).isEqualTo(source.snapshotId()).isNotEqualTo(999L);
+ }
+
+ @Test
+ void testCarryForwardFromModifiedSourceChangesToExisting() {
+ // A MODIFIED entry from a prior commit carried forward without mutation
becomes EXISTING,
+ // preserving the snapshot id from the modify commit.
+ Tracking modifiedSource = sourceTrackingWithStatus(EntryStatus.MODIFIED);
+ Tracking carried = TrackingBuilder.from(modifiedSource, 999L).build();
+ assertThat(carried.status()).isEqualTo(EntryStatus.EXISTING);
+
assertThat(carried.snapshotId()).isEqualTo(modifiedSource.snapshotId()).isNotEqualTo(999L);
Review Comment:
Two issues here:
1. The inline comment on line 421-422 is inverted: `snapshotId` on a
`MODIFIED` source is the *original add* commit (42 in `sourceTracking()`), not
the modify commit. The modify commit lives in `dvSnapshotId` (43). Suggested
rewording: `preserving the snapshot id of the original add`.
2. The test asserts only `status` and `snapshotId`. To actually catch a
regression that drops the modify-commit pointer (`dvSnapshotId`) when a
`MODIFIED` source is carried forward, the test should also assert
`dvSnapshotId`, `dataSequenceNumber`, `fileSequenceNumber`, and `firstRowId`
are preserved from the source.
##########
core/src/main/java/org/apache/iceberg/TrackingBuilder.java:
##########
@@ -101,33 +100,35 @@ private TrackingBuilder(Tracking source, long
newSnapshotId) {
/** Indicates that the DV has been updated for the new Tracking. */
TrackingBuilder dvUpdated() {
- // DV applies to data files; deleted/replaced positions apply to manifest
files
- Preconditions.checkState(
- deletedPositions == null && replacedPositions == null,
- "Cannot mark DV updated on a manifest entry (deleted/replaced
positions are set)");
this.dvSnapshotId = newSnapshotId;
Review Comment:
The last commit dropped two cross-mutator preconditions and the test that
exercised them (`testDvSnapshotIdAndManifestDVPositionsAreMutuallyExclusive`):
- `dvUpdated` no longer rejects entries that already have
`deletedPositions`/`replacedPositions` set.
- `deletedPositions`/`replacedPositions` no longer reject entries that
already have `dvSnapshotId` set.
Was that intentional? With both gone, a caller can silently mix data-file
mutations (`dvUpdated`) and manifest-leaf mutations (positions) on the same
builder and get a `Tracking` that's neither a clean data-file entry nor a clean
manifest leaf. The new model where the position mutators themselves advance
`dvSnapshotId` makes the second check redundant on its own, but the first still
enforced a useful "this is a data-file path" invariant.
##########
core/src/test/java/org/apache/iceberg/TestTrackingStruct.java:
##########
Review Comment:
nit: this test is the only place that pins the same-commit append + DV
attach contract — status stays `ADDED`, not promoted to `MODIFIED`. The name
reads as a generic builder smoke test; something like
`testAddedWithSameCommitDvStaysAdded` would surface the carve-out from the test
list.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]