amogh-jahagirdar commented on PR #10784: URL: https://github.com/apache/iceberg/pull/10784#issuecomment-2623396968
> For my edification, can someone please explain how duplicate file entries in manifests can arise? Can two entries for the same file occur in a single manifest? Can even two manifests be in the same manifest list if they overlap (have an entry for the same file in common)? I'd have thought both of these situations would be bugs. Or are there actual sequences of operations that lead to such outcomes, similar to how dangling deletes can occur? Sure, so one example of an issue that happened in the past is that in the Kafka Connect integration we ended up appending the same file multiple times. We rectified that in Kafka connect and the Iceberg library for duplicate appends in the *same* snapshot, but it's still technically possible to append the same file across different snapshots (at least in the reference implementation and probably a few others). Detecting overlapping files involves an expensive read through current manifest(s) to deduplicate which if performed on every append would be prohibitively expensive for the operation. Overlapping files across manifests does imply a bug in whatever integration is writing to the table. However, even after those bugs are fixed, it still makes sense for Iceberg to expose repair procedures to correct those tables to unblock users from using their tables. More generally, imo it makes sense to offer a general `RepairTable` procedure with different options which enable users to be able to correct their tables as best as possible, in the case a bad implementation ended up writing to them. >Also, I understand that there was an old bug where data file size was written incorrectly and this actually caused reads to fail, and this is the motivation for correcting the statistics in metadata. However, that bug was long fixed, so I wonder if there are still known situations where these statistics need to be corrected. Yeah I think the same I mentioned above applies where some random implementation may end up writing incorrect statistics for whatever reason, and it'd be good for repair table to correct that since stats are something that can be deterministically corrected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org