[GitHub] [iceberg] ldwnt opened a new issue, #6956: data file rewriting spark job fails with oom

via GitHub Mon, 27 Feb 2023 23:07:23 -0800


ldwnt opened a new issue, #6956:
URL: https://github.com/apache/iceberg/issues/6956


   ### Apache Iceberg version
   
   0.13.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   The iceberg table being rewritten has ~ 9 million rows and 2GB. The 
rewriting spark job runs with parameters: target-file-size-bytes=128M, 
spark.driver.memory=20g, spark.executor.memory=3g, num-executors 10, 
deploy-mode cluster, master=yarn.
   Some executors report failed tasks with the logs as below:
   ```
   23/02/28 05:02:33 WARN [executor-heartbeater] Executor: Issue communicating 
with driver in heartbeater
   org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 
milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
        at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
        at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
        at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
        at 
org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
        at 
org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
        at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
[10000 milliseconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        ... 13 more
   ```
   Finally some executors end with oom and the spark job failed. The heap dump 
suggests that a huge StructLikeSet is being created from DeleteFilter
   ```
         StructLikeSet deleteSet = Deletes.toEqualitySet(
             CloseableIterable.transform(
                 records, record -> new 
InternalRecordWrapper(deleteSchema.asStruct()).wrap(record)),
             deleteSchema.asStruct());
   ```
   
![image](https://user-images.githubusercontent.com/7655486/221776784-e835a2f2-3353-4639-a53a-709161b20f62.png)
   This table is updated every few days and each time the whole table is 
deleted and inserted again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ldwnt opened a new issue, #6956: data file rewriting spark job fails with oom

Reply via email to