yadavay-amzn commented on issue #15487:
URL: https://github.com/apache/iceberg/issues/15487#issuecomment-4446080901

   @pvary I investigated the root cause as you suggested in #15568. The tasks 
are not running concurrently -- the lock works correctly and only one task 
executes at a time.
   
   The actual issue is stale table state in `ListMetadataFiles`. This operator 
loads the table once at job start (`open()`) and never calls `table.refresh()` 
in `processElement()`. It only emits manifest list paths for snapshots that 
existed when the Flink job started. Any snapshot added after job start has its 
metadata files missing from the "referenced" set.
   
   When `DeleteOrphanFiles` runs, it correctly refreshes via 
`MetadataTablePlanner` (which does call `table.refresh()`), but the manifest 
list protection comes exclusively from `ListMetadataFiles`. Since that operator 
sees a stale snapshot list, manifest lists of newer snapshots are classified as 
orphans and deleted. On the next cycle, `ExpireSnapshots` tries to read those 
manifest lists in `IncrementalFileCleanup` and fails with `NotFoundException`.
   
   The fix is adding `table.refresh()` to `ListMetadataFiles.processElement()`, 
matching what `MetadataTablePlanner` already does. I have opened #16324 with 
the fix and a regression test that verifies snapshots added after operator open 
are included in the output.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to