rdblue commented on code in PR #12672: URL: https://github.com/apache/iceberg/pull/12672#discussion_r2043087323
########## core/src/main/java/org/apache/iceberg/SnapshotProducer.java: ########## @@ -283,11 +285,26 @@ public Snapshot apply() { throw new RuntimeIOException(e, "Failed to write manifest list file"); } - Long addedRows = null; - Long lastRowId = null; - if (base.rowLineageEnabled()) { - addedRows = calculateAddedRows(manifests); - lastRowId = base.nextRowId(); + Long assignedRows = null; + if (base.formatVersion() >= 3) { + assignedRows = writer.nextRowId() - base.nextRowId(); Review Comment: > I'm struggling now thinking about branching situations I think we may end up assigning different id's to the same rows if they exist in multiple branches. This may be ok though. This is correct: we will assign new row ID ranges for different branches when new commits are added to the branches. We don't know when rows are the same if metadata is missing, like when a table is compacted in one branch and then history ages off. There are also cases where the same file is added twice (through cherry picking) and _should_ get new row IDs. The safest thing for branching is to assign new row IDs for each branch. I think this is not a bad thing. * Branches can't be merged, only fast-forwarded. If you make a new commit in separate branches then you have removed the ability to fast-forward and the branches will remain separate. * Change history has to start when IDs are assigned. We can't really infer history before the table moved to v3. We don't have row-level IDs; the best we could do is infer the same set of IDs for data files, but that's tricky and probably not worth it because we don't have row-level changes anyway. > If we are ok with that we could just go to every snapshot without any parents. We treat all rows in that snapshot as having been added in that snapshot, this means all children of that snapshot start at least at the total-row count of the snapshot I agree that we could reconstruct row lineage metadata using a similar scheme. There are problems with this, though. First, it modifies existing snapshots. That would require a lot of extra work for the upgrade. The REST spec doesn't currently allow this other than by adding copies with different IDs and removing the old versions. Second, even if we were okay with modifying existing snapshots (or producing a completely new set of them) we would be creating situations where table metadata is misleading because we've filled in IDs with imperfect information. I would prefer not assigning IDs to snapshots that never had them because that is correctly reporting what we know. If we fill in the history, we still don't get row-level tracking in old commits because the `_row_id` field wasn't written to new data files. I think the argument against filling in historical row IDs is that it is extra work to think through the issues and build an algorithm to modify old data. And when we've done that extra work, we don't actually gain anything: you still can't reason about the IDs of records between branches or even within a branch. My conclusion is that we should just expose the data that we know, which is that the row ID was null until the row was assigned an ID in a branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org