amogh-jahagirdar commented on PR #10784: URL: https://github.com/apache/iceberg/pull/10784#issuecomment-2274937702
@szehon-ho I think I agree that the use cases presented are legitimate repair cases however I'm not sure that we should be combining all of them into a single `RepairTable` procedure upfront. I understand the appeal but my thinking is that the implementation gets complex. My thinking is that users will typically hit some issue, like missing file referenced in their manifest or hit duplicate files and then just want to solve that issue with a focused procedure. Or in the other missing metadata file/missing manifest list/manifest/corrupted snapshot case run a separate procedure. That said one thing to consider is do we need these procedures to move in lockstep somehow to ensure that statistics/snapshot summaries roll up correctly? I don't think that's the case since in most cases we are producing new snapshots entirely after the repair and the deltas in rows/files added/deleted should be accumulated correctly? We'll probably need tests for that. I do think that we could add a `RepairTable` down the line which combines multiple smaller repair actions so that the implementation of `RepairTable` itself isn't super complex. It comes at the cost of having more in the action API surface area but I feel like that's OK? More concretely at least what I'm thinking so far: 1. RepairManifests which is focused on looking at the table's current manifests and removing duplicates/attempting recovery of missing files or deleting those entries if those files cannot be recovered 2. RepairTableMetadata which would do the historical metadata file recovery, manifest list repairs and snapshots? @szehon-ho I actually had a question on the snapshot repair, based on the description the goal of that is to repair snapshot summary stats which may have been corrupted. Doesn't that necessarily mean we must mutate the existing snapshot (and subsequent snapshots) to correct it? Overall though I'll def give the general `RepairTable` more thought, just feels a bit complex to put everything in one go up front rather than have focused actions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org