krisnaru opened a new issue, #14458:
URL: https://github.com/apache/iceberg/issues/14458

   ### Apache Iceberg version
   
   1.10.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   The snapshotId filtering logic was incorrectly excluding live data files 
during table copy operations. entry.snapshotId() records when a data file was 
initially added, not which snapshots currently reference it. After manifest 
compaction or snapshot expiration, a snapshot can reference manifests 
containing entries with expired snapshotIds, but those files are still live and 
must be copied.
   
   The check snapshotIds.contains(entry.snapshotId()) was fundamentally wrong 
because it filtered out data files whose original snapshot had expired, even 
though they were still referenced by the snapshot(s) being copied.
   
   This bug likely affects many production tables where manifest compaction has 
run. Customers may not notice the issue if they don't query the missing data 
files.
   
   ### Willingness to contribute
   
   - [x] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to