vaultah commented on PR #13720: URL: https://github.com/apache/iceberg/pull/13720#issuecomment-3235133985
@dramaticlly @stevenzwu Let's say we have manifests A, B, and C, added by snapshot 1. We then create snapshot 2: its manifest list will contain references to manifests A, B, and C, and we also add the reference to the new manifest D. Now we use `rewrite_table_path` in incremental mode, starting from snapshot 1. Per your suggestion, it will rewrite just manifest D and the manifest list of snapshot 2. We assume that manifests added in snapshot 1 were already rewritten before, so we simply update their paths in the manifest list of snapshot 2. For manifest D, the rewritten manifest list will have the new path and the new size. For manifests A, B, C, the rewritten manifest list will have new paths and their original lengths. In other words, the rewritten manifest list will be | manifest_path | manifest_length | ... | | --- | --- | --- | | newPath(D) | newLength(D)| | | newPath(C) | oldLength(C)| | | newPath(B) | oldLength(B)| | | newPath(A) | oldLength(A)| | As a result of rewriting, manifest length will almost certainly change, so in general `oldLength(A) != newLength(A)`, which means the size of manifest A in the rewritten manifest list is incorrect, as in it doesn't match the length of the actual physical file at `newPath(A)` that it's referencing. This is the scenario from https://github.com/apache/iceberg/issues/13719 that I'm trying to solve. Please help me understand how the correctness is maintained in your suggestion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
