amogh-jahagirdar commented on PR #10784:
URL: https://github.com/apache/iceberg/pull/10784#issuecomment-2274937702

   @szehon-ho I think I agree that the use cases presented are legitimate 
repair cases however I'm not sure that we should be combining all of them into 
a single `RepairTable` procedure upfront. I understand the appeal but my 
thinking is that the implementation gets complex. My thinking is that users 
will typically hit some issue, like missing file referenced in their manifest 
or hit duplicate files and then just want to solve that issue with a focused 
procedure. Or in the other missing metadata file/missing manifest 
list/manifest/corrupted snapshot case run a separate procedure.
   
   That said one thing to consider is do we need these procedures to move in 
lockstep somehow to ensure that statistics/snapshot summaries roll up 
correctly? I don't think that's the case since in most cases we are producing 
new snapshots entirely after the repair and the deltas in rows/files 
added/deleted should be accumulated correctly? We'll probably need tests for 
that.
   
   I do think that we could add a `RepairTable` down the line which combines 
multiple smaller repair actions so that the implementation of `RepairTable` 
itself isn't super complex. It comes at the cost of having more in the action 
API surface area but I feel like that's OK?
   
   More concretely at least what I'm thinking so far:
   
   1. RepairManifests which is focused on looking at the table's current 
manifests and removing duplicates/attempting recovery of missing files or 
deleting those entries if those files cannot be recovered
   2. RepairTableMetadata which would do the historical metadata file recovery, 
manifest list repairs and snapshots?
   
   @szehon-ho I actually had a question on the snapshot repair, based on the 
description the goal of that is to repair snapshot summary stats which may have 
been corrupted. Doesn't that necessarily mean we must mutate the existing 
snapshot (and subsequent snapshots) to correct it?
   
   Overall though I'll def give the general `RepairTable` more thought, just 
feels a bit complex to put everything in one go up front rather than have 
focused actions.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to