rdblue commented on code in PR #12672:
URL: https://github.com/apache/iceberg/pull/12672#discussion_r2043087323


##########
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##########
@@ -283,11 +285,26 @@ public Snapshot apply() {
       throw new RuntimeIOException(e, "Failed to write manifest list file");
     }
 
-    Long addedRows = null;
-    Long lastRowId = null;
-    if (base.rowLineageEnabled()) {
-      addedRows = calculateAddedRows(manifests);
-      lastRowId = base.nextRowId();
+    Long assignedRows = null;
+    if (base.formatVersion() >= 3) {
+      assignedRows = writer.nextRowId() - base.nextRowId();

Review Comment:
   > I'm struggling now thinking about branching situations I think we may end 
up assigning different id's to the same rows if they exist in multiple 
branches. This may be ok though.
   
   This is correct: we will assign new row ID ranges for different branches 
when new commits are added to the branches. We don't know when rows are the 
same if metadata is missing, like when a table is compacted in one branch and 
then history ages off. There are also cases where the same file is added twice 
(through cherry picking) and _should_ get new row IDs.
   
   The safest thing for branching is to assign new row IDs for each branch. I 
think this is not a bad thing.
   * Branches can't be merged, only fast-forwarded. If you make a new commit in 
separate branches then you have removed the ability to fast-forward and the 
branches will remain separate.
   * Change history has to start when IDs are assigned. We can't really infer 
history before the table moved to v3. We don't have row-level IDs; the best we 
could do is infer the same set of IDs for data files, but that's tricky and 
probably not worth it because we don't have row-level changes anyway.
   
   > If we are ok with that we could just go to every snapshot without any 
parents. We treat all rows in that snapshot as having been added in that 
snapshot, this means all children of that snapshot start at least at the 
total-row count of the snapshot
   
   I agree that we could reconstruct row lineage metadata using a similar 
scheme. There are problems with this, though.
   
   First, it modifies existing snapshots. That would require a lot of extra 
work for the upgrade. The REST spec doesn't currently allow this other than by 
adding copies with different IDs and removing the old versions.
   
   Second, even if we were okay with modifying existing snapshots (or producing 
a completely new set of them) we would be creating situations where table 
metadata is misleading because we've filled in IDs with imperfect information. 
I would prefer not assigning IDs to snapshots that never had them because that 
is correctly reporting what we know. If we fill in the history, we still don't 
get row-level tracking in old commits because the `_row_id` field wasn't 
written to new data files.
   
   I think the argument against filling in historical row IDs is that it is 
extra work to think through the issues and build an algorithm to modify old 
data. And when we've done that extra work, we don't actually gain anything: you 
still can't reason about the IDs of records between branches or even within a 
branch.
   
   My conclusion is that we should just expose the data that we know, which is 
that the row ID was null until the row was assigned an ID in a branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to