lixy529 opened a new issue, #352: URL: https://github.com/apache/incubator-uniffle/issues/352
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues. ### Describe the bug Many tasks tasks of spark jobs will throw the exceptions that the inconsistent blocks number. The stacktrace is as follows: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 9.0 failed 6 times, most recent failure: Lost task 2.5 in stage 9.0 (TID 7653, BJLFRZ-10k-152-228.hadoop.jd.local, executor 159): org.apache.uniffle.common.exception.RssException: Blocks read inconsistent: expected 7 blocks, actual 0 blocks at org.apache.uniffle.client.impl.ShuffleReadClientImpl.checkProcessedBlockIds(ShuffleReadClientImpl.java:215) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:135) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.shuffle.reader.RssShuffleReader$MultiPartitionIterator.hasNext(RssShuffleReader.java:227) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:768) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:134) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:129) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:467) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1478) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:470) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2083) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2032) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2031) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2031) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:979) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:979) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:979) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2263) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2212) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2201) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Caused by: org.apache.uniffle.common.exception.RssException: Blocks read inconsistent: expected 7 blocks, actual 0 blocks at org.apache.uniffle.client.impl.ShuffleReadClientImpl.checkProcessedBlockIds(ShuffleReadClientImpl.java:215) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:135) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.shuffle.reader.RssShuffleReader$MultiPartitionIterator.hasNext(RssShuffleReader.java:227) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:768) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:134) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ### Affects Version(s) 0.6.0 ### Uniffle Server Log Output _No response_ ### Uniffle Engine Log Output _No response_ ### Uniffle Server Configurations ```yaml rss.coordinator.quorum=xxx:19999,xxx:19999,xxx:19999 rss.jetty.http.port=19998 rss.prometheus.push.enabled=true rss.prometheus.uniffle.cluster.name=test100 rss.rpc.executor.size=2000 rss.rpc.message.max.size=1073741824 rss.rpc.server.port=19999 rss.server.app.expired.withoutHeartbeat=120000 rss.server.buffer.capacity=30g rss.server.commit.timeout=600000 rss.server.flush.cold.storage.threshold.size=64m rss.server.flush.thread.alive=5 rss.server.flush.threadPool.size=10 rss.server.heartbeat.interval=10000 rss.server.heartbeat.timeout=60000 rss.server.localstorage.initialize.max.fail.number=6 rss.server.preAllocation.expired=120000 rss.server.read.buffer.capacity=15g rss.storage.type=MEMORY_HDFS ``` ### Uniffle Engine Configurations _No response_ ### Additional context _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
