Re: [I] feat(table): preserve row lineage through copy-on-write and merge-on-read [iceberg-go]

via GitHub Fri, 15 May 2026 12:11:19 -0700


tanmayrauth commented on issue #999:
URL: https://github.com/apache/iceberg-go/issues/999#issuecomment-4462632884


   Here's my thinking on the three open questions. Cross-referenced Java's 
implementation of Spark/Flink write paths.                                      
                                                            
                                                                                
                                                                                
                                                                            
     ---             
   1. Where the rewrite path reads source _row_id                               
                                                                                
                                                                          
                                                                                
                                                                                
                                                                            
   I'd go with scanner projection rather than re-reading source Parquet 
separately.
                                                                                
                                                                                
                                                                           
     During a rewrite, we add _row_id and _last_updated_sequence_number to the 
scan's projected schema. The scanner already synthesizes these from file 
metadata (first_row_id + position). The writer then stores the values 
explicitly in  output Parquet instead of leaving them null.                     
                                                                                
                                                                                
      
                                                                                
                                                                                
                                                                            
     Java does the same, SparkTable adds lineage columns to the read schema 
when isTableRewrite=true. The reader transparently handles both cases: files 
that already have an explicit _row_id column (from a previous rewrite) use the  
stored value; files without it get synthesis. Our synthesizeRowLineageColumns 
already implements this dual path.
                                                                                
                                                                                
                                                                            
     The alternative (re-reading source Parquet outside the scanner) would add 
a second I/O pass, duplicate synthesis logic, and break on multi-rewrite files 
that already have explicit _row_id. I don't see a reason to diverge from Java  
here.
                                                                                
                                                                                
                                                                            
     I'm thinking a WithPreserveRowLineage() writer optio, the change lives 
entirely at the scan+write layer, snapshot producer stays unaware.              
                                                                              
      
     ---                                                                        
                                                                                
                                                                            
   2. next-row-id accounting for rewrites
                                                                                
                                                                                
                                                                            
   @laskoviymishka  I believe the current logic is already correct and no 
change is needed.
                                                                                
                                                                                
                                                                            
   Today addedRows = *writer.NextRowID() - firstRowID counts all rows in new 
manifests, including preserved survivors. This "wastes" ID space but doesn't 
violate uniquenes, the actual row IDs seen at query time come from the explicit 
Parquet column, not the global counter.                                         
                                                                                
                                                              
                                                                                
                                                                                
                                                                            
   Java does the same. ManifestListWriter.V4Writer advances nextRowId by the 
full manifest row count regardless of whether rows inside preserve old IDs. The 
ID space is int64, so waste isn't a practical concern.                       
      
   I'd add a test + comment documenting this is intentional rather than 
changing the accounting.                                                        
                                                                                
  
                     
     ---                                                                        
                                                                                
                                                                            
   3. MoR equality-rewrite: which rows get fresh IDs?
                                                                                
                                                                                
                                                                            
   For pure compaction (bin-pack), I don't think we need a distinction, lineage 
columns flow through as regular data from the scan projection.
                                                                                
                                                                                
                                                                            
     For MoR UPDATEs (position-delta path), the writer needs to know intent. 
I'm thinking an insert vs reinsert split on the writer API:                     
                                                                               
     - Reinsert(record, rowID, seqNum, survivor row, preserves _row_id, nulls 
_last_updated_sequence_number) so it inherits the new file's 
data_sequence_number                                                            
                
     - Insert(record) - fresh row, both null, gets new identity at read time    
                                                                                
                                                                            
                                                                            
     Java uses this exact pattern: SparkPositionDeltaWrite.reinsert(meta, row) 
vs insert(row).                                                                 
                                                                             
                                                                                
                                                                                
                                                                            
     Worth noting: equality deletes cannot preserve lineage per spec, engine 
writes without reading old identity. Only pos-delete rewrites and CoW can 
preserve.                                                                       
    
                                                                                
                                                                                
                                                                            
     ---                                                                        
                                                                                
                                                                            
     PR sequence (proposed)
                                                                                
                                                                                
                                                                            
     1. CoW rewrite preserves IDs : lineage columns in compaction scan, 
explicit write in output Parquet
     2. Accounting validation : test + comment confirming overcounting is 
correct                                                                         
                                                                                
  
     3. MoR equality-rewrite : insert/reinsert writer API                       
                                                                                
                                                                            
                                                                                
                                                                                
                                                                            
     @zeroshade  @laskoviymishka Let me know if any of this feels off or if 
you'd approach something differently.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] feat(table): preserve row lineage through copy-on-write and merge-on-read [iceberg-go]

Reply via email to