fuzing opened a new issue, #12883: URL: https://github.com/apache/iceberg/issues/12883
### Feature Request / Improvement At the moment, once a data file goes missing or becomes corrupted, table functionality is diminished or completely lost due to cascading errors as a result of the missing/corrupted files (depends on query engine etc.) In the event of data file loss or corruption, it would be useful to have a procedure that regenerates a new snapshot and metadata that excludes the missing/corrupted file/s, while reporting same. This procedure might be extended to include those circumstances where metadata and/or snapshot files are corrupted (with varying degrees of rebuild success depending on the damage). One could imagine multiple strategies for such a tool - e.g.: - Perform a simple data file existence check and exclude those that are missing (cheap, because data files don't need to be read) - Perform a complete sanity check of the table structure (expensive, as each data file would need to be decompressed/ingested and checked for integrity) - etc. Similar to other (spark) procedures, this one might have a "dry_run" flag such that issues are identified and the plan for repair is articulated prior to initiating it. ### Query engine Spark ### Willingness to contribute - [ ] I can contribute this improvement/feature independently - [ ] I would be willing to contribute this improvement/feature with guidance from the Iceberg community - [ ] I cannot contribute this improvement/feature at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org