amogh-jahagirdar commented on PR #10445:
URL: https://github.com/apache/iceberg/pull/10445#issuecomment-2229624400

   @szehon-ho @danielcweeks I did a comparison between the two PRs, and 
actually seems like @szehon-ho PR is more about repairing manifest entry 
details from the actual data file on disk. So the concerns being addressed by 
the two repair functions are a bit different. I think what I'd propose is (I 
think @szehon-ho is saying the same thing, but let me know if I'm 
misinterpreting):
   
   1.) we take this forward to handle missing file references and duplicate 
files.
   2.)later on add an option to the procedure to do the repair manifest entry 
statistics from the actual data file on disk. This is a bit "Deeper" of a 
repair imo in the sense that we actually need to read from disk. Although 
arguably here in the current implementation we still need to do the disk read 
for a file existence check for missing files anyways? I think it'll become 
clear as we tighten up this implementation.
   
   Another aspect I'd propose is a `SupportsFileRecovery` mixin for `FileIO`. 
For example for S3FileIO, if the bucket is version enabled, we could attempt a 
best effort to recover in the case a live manifest entry points to a file which 
for whatever reason no longer exists on disk. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to