SandeepSinghGahir commented on issue #10340: URL: https://github.com/apache/iceberg/issues/10340#issuecomment-2550591724
Hi @amogh-jahagirdar, This issue isn't resolved yet. Upon Glue 5.0 release, I tested a job with Iceberg 1.7.0 and I'm still seeing the same error with just different logging. Here is stack trace: Any help in resolving this issue is greatly appreciated. ``` ERROR 2024-12-07T02:04:22,219 814843 com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener [spark-listener-group-shared] [Glue Exception Analysis] {"Event":"GlueExceptionAnalysisStageFailed","Timestamp":1733537062218,"Failure Reason":"org.apache.spark.shuffle.FetchFailedException: Error in reading FileSegmentManagedBuffer[file=/tmp/blockmgr-e22f16fc-d99e-4692-aa4b-66a91/0c/shuffle_11_118332_0.data,offset=288812863,length=188651]","Stack Trace":[{"Declaring Class":"org.apache.spark.errors.SparkCoreErrors$","Method Name":"fetchFailedError","File Name":"SparkCoreErrors.scala","Line Number":437},{"Declaring Class":"org.apache.spark.storage.ShuffleBlockFetcherIterator","Method Name":"throwFetchFailedException","File Name":"ShuffleBlockFetcherIterator.scala","Line Number":1304},{"Declaring Class":"org.apache.spark.storage.ShuffleBlockFetcherIterator","Method Name":"next","File Name":"ShuffleBlockFetcherIterator.scala","Line Number":957},{"Declaring Class":"org.apach e.spark.storage.Shuffl ERROR 2024-12-07T02:04:25,531 818155 org.apache.spark.scheduler.TaskSchedulerImpl [dispatcher-CoarseGrainedScheduler] Lost executor 95 on 172.34.30.9: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 95), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:145) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:206) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.bas )at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 95), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:145) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:206) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.ba at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 95), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:145) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:152) at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:155) at org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:403) at org.apache.spark.storage.ShuffleBlockFetcherIterator.send$1 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 95), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:145) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:206) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.bas at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 95), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:145) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:206) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.bas ) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 95), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:145) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:206) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.ba ERROR 2024-12-07T02:04:28,291 820915 com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener [spark-listener-group-shared] [Glue Exception Analysis] {"Event":"GlueExceptionAnalysisTaskFailed","Timestamp":1733537068290,"Failure Reason":"Connection pool shut down","Stack Trace":[{"Declaring Class":"software.amazon.awssdk.thirdparty.org.apache.http.util.Asserts","Method Name":"check","File Name":"Asserts.java","Line Number":34},{"Declaring Class":"software.amazon.awssdk.thirdparty.org.apache.http.impl.conn.PoolingHttpClientConnectionManager","Method Name":"requestConnection","File Name":"PoolingHttpClientConnectionManager.java","Line Number":269},{"Declaring Class":"software.amazon.awssdk.http.apache.internal.conn.ClientConnectionManagerFactory$DelegatingHttpClientConnectionManager","Method Name":"requestConnection","File Name":"ClientConnectionManagerFactory.java","Line Number":75},{"Declaring Class":"software.amazon.awssdk.http.apache.internal.conn.ClientConnecti onManagerFactory$Instrumented at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 95), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:145) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:173) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:206) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base ERROR 2024-12-07T02:04:28,401 821025 org.apache.spark.scheduler.TaskSetManager [task-result-getter-3] Task 47 in stage 51.3 failed 4 times; aborting job ERROR 2024-12-07T02:04:28,408 821032 com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener [spark-listener-group-shared] [Glue Exception Analysis] {"Event":"GlueExceptionAnalysisJobFailed","Timestamp":1733537068406,"Failure Reason":"JobFailed(org.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 51.3 failed 4 times, most recent failure: Lost task 47.3 in stage 51.3 (TID 172048) (172.36.175.193 executor 46): java.lang.IllegalStateException: Connection pool shut down","Stack Trace":[{"Declaring Class":"org.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 51.3 failed 4 times, most recent failure: Lost task 47.3 in stage 51.3 (TID 172048) (172.36.175.193 executor 46): java.lang.IllegalStateException: Connection pool shut down","Method Name":"TopLevelFailedReason","File Name":"TopLevelFailedReason","Line Number":-1},{"Declaring Class":"software.amazon.awssdk.thirdparty.org.apache.http.util.Asserts","Metho d Name":"check","File Name":" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org