chenwyi2 opened a new issue, #8806:
URL: https://github.com/apache/iceberg/issues/8806

   ### Apache Iceberg version
   
   1.2.1
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 🐞
   
   recently i met a job failed with "Failed to open input stream for file: 
xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro",
 the siutation is a task failed with checkpoint id 25893, then then restart the 
job, it will reset the checkpoint ID to 25893 and restore job from Savepoint 
25892, however some temprory manifests can be deleted when commiting 
successfully, so manifests with checkpoint id 25892 were deleted before,, how 
can we deal with this? 
   detail log is:
   `2023-10-09 16:39:57,724 INFO  org.apache.iceberg.hive.HiveTableOperations   
               [] - Committed to table icebergCatalog.xxx with the new metadata 
location xxx/metadata/300237-907ef004-3085-439f-b606-fc2b106bcb54.metadata.json
   2023-10-09 16:39:57,747 INFO  org.apache.hadoop.fs.TrashPolicyDefault        
              [] - Moved: 
'xxx/metadata/300136-01612e4a-add1-4a3f-b7e7-1ee25e063e04.metadata.json' to 
trash 
   2023-10-09 16:39:57,747 INFO  
org.apache.iceberg.BaseMetastoreTableOperations              [] - Successfully 
committed to table icebergCatalog.xxx in 3142 ms
   2023-10-09 16:39:57,747 INFO  org.apache.iceberg.SnapshotProducer            
              [] - Committed snapshot 8753072822283034565 (MergeAppend)
   2023-10-09 16:39:57,788 INFO  
org.apache.iceberg.flink.sink.IcebergFilesCommitter          [] - Committed 
append to table: icebergCatalog.xxx, branch: main, checkpointId 25892 in 7394 ms
   2023-10-09 16:39:58,011 INFO  org.apache.hadoop.fs.TrashPolicyDefault        
              [] - Moved: 
'xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro'
 to trash 
   2023-10-09 16:39:58,011 INFO  
org.apache.iceberg.flink.sink.IcebergFilesCommitter          [] - deleted 
manifest : 
xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro
   `
   then failed with other reasons
   `2023-10-09 16:41:59,902 INFO  
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Checkpoint 25893 has been notified as aborted, would not trigger any 
checkpoint.`
   
   restart
   
   `2023-10-09 16:43:41,742 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Reset the 
checkpoint ID of job 3e878a638ceb45633f31e8813c521740 to 25893.
   2023-10-09 16:43:41,742 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
3e878a638ceb45633f31e8813c521740 from Savepoint 25892 @ 0 for 
3e878a638ceb45633f31e8813c521740 located at xxx`
   
   but the manifest was deleted before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to