dramaticlly commented on PR #13720:
URL: https://github.com/apache/iceberg/pull/13720#issuecomment-3235150003

   > @dramaticlly @stevenzwu
   > 
   > Let's say we have manifests A, B, and C, added by snapshot 1. We then 
create snapshot 2: its manifest list will contain references to manifests A, B, 
and C, and we also add the reference to the new manifest D.
   > 
   > Now we use `rewrite_table_path` in incremental mode, starting from 
snapshot 1.
   > 
   > Per your suggestion, it will rewrite just manifest D and the manifest list 
of snapshot 2. We assume that manifests added in snapshot 1 were already 
rewritten before, so we simply update their paths in the manifest list of 
snapshot 2.
   > 
   > For manifest D, the rewritten manifest list will have the new path and the 
new size. For manifests A, B, C, the rewritten manifest list will have new 
paths and their original lengths. In other words, the rewritten manifest list 
will be
   > 
   > manifest_path      manifest_length ...
   > newPath(D) newLength(D)    
   > newPath(C) oldLength(C)    
   > newPath(B) oldLength(B)    
   > newPath(A) oldLength(A)    
   > As a result of rewriting, manifest length will almost certainly change, so 
in general `oldLength(A) != newLength(A)`, which means the size of manifest A 
in the rewritten manifest list is incorrect, as in it doesn't match the length 
of the actual physical file at `newPath(A)` that it's referencing. This is the 
scenario from #13719 that I'm trying to solve.
   > 
   > Please help me understand how the correctness is maintained in your 
suggestion
   
   I think we are trying to do 2 things at the same time 
   1. incremental copy between existing and new version files, let's say 
snapshot 2 produced new manifest D and manifest A/B/C already exists in target 
table before this incremental copy. We want to rewrite the metadata files 
needed to copy all the incremental files over from source to target
   2. fix the size for existing manifest A/B/C in latest manifest-list or 
snapshot 2
   
   As of now, we are trying to rewrite manifest A/B/C/D as part of this 
SparkAction, but only manifest D is strictly required if sizing is not a 
problem. 
   
   Personally, to properly fix on all historical manifest sizing in the target 
table, we might need more than just incremental copy, so the recommendation is 
do a complete rewrite (non-incremental) for one time fix. All future rewrite of 
table path afterward will be delta based
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to