amogh-jahagirdar commented on PR #10784:
URL: https://github.com/apache/iceberg/pull/10784#issuecomment-2623396968

    > For my edification, can someone please explain how duplicate file entries 
in manifests can arise? Can two entries for the same file occur in a single 
manifest? Can even two manifests be in the same manifest list if they overlap 
(have an entry for the same file in common)? I'd have thought both of these 
situations would be bugs. Or are there actual sequences of operations that lead 
to such outcomes, similar to how dangling deletes can occur?
   
   Sure, so one example of an issue that happened in the past is that in the 
Kafka Connect integration we ended up appending the same file multiple times. 
We rectified that in Kafka connect and the Iceberg library for duplicate 
appends in the *same* snapshot, but it's still technically possible to append 
the same file across different snapshots (at least in the reference 
implementation and probably a few others). 
   
   Detecting overlapping files involves an expensive read through current 
manifest(s) to deduplicate which if performed on every append would be 
prohibitively expensive for the operation. 
   
   Overlapping files across manifests does imply a bug in whatever integration 
is writing to the table. However, even after those bugs are fixed, it still 
makes sense for Iceberg to expose repair procedures to correct those tables to 
unblock users from using their tables. 
   More generally, imo it makes sense to offer a general `RepairTable` 
procedure with different options which enable users to be able to correct their 
tables as best as possible, in the case a bad implementation ended up writing 
to them.
   
   >Also, I understand that there was an old bug where data file size was 
written incorrectly and this actually caused reads to fail, and this is the 
motivation for correcting the statistics in metadata. However, that bug was 
long fixed, so I wonder if there are still known situations where these 
statistics need to be corrected.
   
   
   Yeah I think the same I mentioned above applies where some random 
implementation may end up writing incorrect statistics for whatever reason, and 
it'd be good for repair table to correct that since stats are something that 
can be deterministically corrected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to