amogh-jahagirdar commented on PR #10445: URL: https://github.com/apache/iceberg/pull/10445#issuecomment-2229624400
@szehon-ho @danielcweeks I did a comparison between the two PRs, and actually seems like @szehon-ho PR is more about repairing manifest entry details from the actual data file on disk. So the concerns being addressed by the two repair functions are a bit different. I think what I'd propose is (I think @szehon-ho is saying the same thing, but let me know if I'm misinterpreting): 1.) we take this forward to handle missing file references and duplicate files. 2.)later on add an option to the procedure to do the repair manifest entry statistics from the actual data file on disk. This is a bit "Deeper" of a repair imo in the sense that we actually need to read from disk. Although arguably here in the current implementation we still need to do the disk read for a file existence check for missing files anyways? I think it'll become clear as we tighten up this implementation. Another aspect I'd propose is a `SupportsFileRecovery` mixin for `FileIO`. For example for S3FileIO, if the bucket is version enabled, we could attempt a best effort to recover in the case a live manifest entry points to a file which for whatever reason no longer exists on disk. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org