yadavay-amzn commented on issue #15487: URL: https://github.com/apache/iceberg/issues/15487#issuecomment-4446080901
@pvary I investigated the root cause as you suggested in #15568. The tasks are not running concurrently -- the lock works correctly and only one task executes at a time. The actual issue is stale table state in `ListMetadataFiles`. This operator loads the table once at job start (`open()`) and never calls `table.refresh()` in `processElement()`. It only emits manifest list paths for snapshots that existed when the Flink job started. Any snapshot added after job start has its metadata files missing from the "referenced" set. When `DeleteOrphanFiles` runs, it correctly refreshes via `MetadataTablePlanner` (which does call `table.refresh()`), but the manifest list protection comes exclusively from `ListMetadataFiles`. Since that operator sees a stale snapshot list, manifest lists of newer snapshots are classified as orphans and deleted. On the next cycle, `ExpireSnapshots` tries to read those manifest lists in `IncrementalFileCleanup` and fails with `NotFoundException`. The fix is adding `table.refresh()` to `ListMetadataFiles.processElement()`, matching what `MetadataTablePlanner` already does. I have opened #16324 with the fix and a regression test that verifies snapshots added after operator open are included in the output. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
