chenwyi2 opened a new issue, #8806: URL: https://github.com/apache/iceberg/issues/8806
### Apache Iceberg version 1.2.1 ### Query engine Flink ### Please describe the bug 🐞 recently i met a job failed with "Failed to open input stream for file: xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro", the siutation is a task failed with checkpoint id 25893, then then restart the job, it will reset the checkpoint ID to 25893 and restore job from Savepoint 25892, however some temprory manifests can be deleted when commiting successfully, so manifests with checkpoint id 25892 were deleted before,, how can we deal with this? detail log is: `2023-10-09 16:39:57,724 INFO org.apache.iceberg.hive.HiveTableOperations [] - Committed to table icebergCatalog.xxx with the new metadata location xxx/metadata/300237-907ef004-3085-439f-b606-fc2b106bcb54.metadata.json 2023-10-09 16:39:57,747 INFO org.apache.hadoop.fs.TrashPolicyDefault [] - Moved: 'xxx/metadata/300136-01612e4a-add1-4a3f-b7e7-1ee25e063e04.metadata.json' to trash 2023-10-09 16:39:57,747 INFO org.apache.iceberg.BaseMetastoreTableOperations [] - Successfully committed to table icebergCatalog.xxx in 3142 ms 2023-10-09 16:39:57,747 INFO org.apache.iceberg.SnapshotProducer [] - Committed snapshot 8753072822283034565 (MergeAppend) 2023-10-09 16:39:57,788 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Committed append to table: icebergCatalog.xxx, branch: main, checkpointId 25892 in 7394 ms 2023-10-09 16:39:58,011 INFO org.apache.hadoop.fs.TrashPolicyDefault [] - Moved: 'xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro' to trash 2023-10-09 16:39:58,011 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - deleted manifest : xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro ` then failed with other reasons `2023-10-09 16:41:59,902 INFO org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Checkpoint 25893 has been notified as aborted, would not trigger any checkpoint.` restart `2023-10-09 16:43:41,742 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Reset the checkpoint ID of job 3e878a638ceb45633f31e8813c521740 to 25893. 2023-10-09 16:43:41,742 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 3e878a638ceb45633f31e8813c521740 from Savepoint 25892 @ 0 for 3e878a638ceb45633f31e8813c521740 located at xxx` but the manifest was deleted before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org