Dear Hadoop community,
I am testing my yarn cluster with zeppelin notebooks and using spark engine to
submit python code. The whole point of this exercise is to analyze how yarn
behaves in different situations.
For instance I am trying to run a code that needs relatively big amount of
memory compared to the memory allocated to the spark driver.
The job fails but I can't easily see the reason of why this is happening so I
am wondering whether I have other problems (e.g error in cluster configuration):
This is the submitted code (the only thing I am trying to do is to allocate a
big amount of memory to see how yarn behaves):
a = "bigword"
aList = []
for i in range(1000):
aList.append(i**i*a)
#print aList
for word in aList:
print word
The application fails and this is the error I can see in the logs:
Exception 1:
WARN [2019-10-17 09:28:58,567] ({Reporter} Logging.scala[logWarning]:66) -
Container marked as failed: container_e15_1570749574365_0027_01_000003 on host:
r940-1-1-mlx.mlx. Exit status: 1. Diagnostics: [2019-10-17
09:28:57.458]Exception from container-launch.
Container id: container_e15_1570749574365_0027_01_000003
Exit code: 1
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is mansop
main : requested yarn user is mansop
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file
/d1/hadoop/yarn/local/nmPrivate/application_1570749574365_0027/container_e15_1570749574365_0027_01_000003/container_e15_1570749574365_0027_01_000003.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
[2019-10-17 09:28:57.459]Container exited with a non-zero exit code 1. Error
file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class
org.apache.spark.executor.CoarseGrainedExecutorBackend
[2019-10-17 09:28:57.462]Container exited with a non-zero exit code 1. Error
file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class
org.apache.spark.executor.CoarseGrainedExecutorBackend
WARN [2019-10-17 09:28:58,571] ({dispatcher-event-loop-19}
Logging.scala[logWarning]:66) - Requesting driver to remove executor 2 for
reason Container marked as failed: container_e15_1570749574365_0027_01_000003
on host: r940-1-1-mlx.mlx. Exit status: 1. Diagnostics: [2019-10-17
09:28:57.458]Exception from container-launch.
Container id: container_e15_1570749574365_0027_01_000003
Exit code: 1
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is mansop
main : requested yarn user is mansop
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file
/d1/hadoop/yarn/local/nmPrivate/application_1570749574365_0027/container_e15_1570749574365_0027_01_000003/container_e15_1570749574365_0027_01_000003.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
[2019-10-17 09:28:57.459]Container exited with a non-zero exit code 1. Error
file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class
org.apache.spark.executor.CoarseGrainedExecutorBackend
[2019-10-17 09:28:57.462]Container exited with a non-zero exit code 1. Error
file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class
org.apache.spark.executor.CoarseGrainedExecutorBackend
Exception 2:
DEBUG [2019-10-17 09:29:20,583] ({pool-5-thread-1}
RemoteInterpreterServer.java[interpret]:426) - st:
a = "bigword"
aList = []
for i in range(1000):
aList.append(i**i*a)
#print aList
for word in aList:
print word
DEBUG [2019-10-17 09:29:20,584] ({pool-5-thread-1}
RemoteInterpreterEventClient.java[sendEvent]:413) - Send Event:
RemoteInterpreterEvent(type:OUTPUT_UPDATE_ALL,
data:{"messages":[],"noteId":"2ENM9X82N","paragraphId":"20190926-163159_1153559848"})
DEBUG [2019-10-17 09:29:20,584] ({pool-5-thread-2}
RemoteInterpreterEventClient.java[pollEvent]:366) - Send event OUTPUT_UPDATE_ALL
INFO [2019-10-17 09:29:20,584] ({pool-6-thread-4}
SchedulerFactory.java[jobStarted]:109) - Job 20190926-163159_1153559848 started
by scheduler interpreter_1768035734
DEBUG [2019-10-17 09:29:20,585] ({pool-6-thread-4}
RemoteInterpreterServer.java[jobRun]:632) - Script after hooks: a = "bigword"
aList = []
for i in range(1000):
aList.append(i**i*a)
#print aList
for word in aList:
print word
__zeppelin__._displayhook()
DEBUG [2019-10-17 09:29:20,585] ({pool-6-thread-4}
RemoteInterpreterEventClient.java[sendEvent]:413) - Send Event:
RemoteInterpreterEvent(type:META_INFOS, data:{"message":"Spark UI
enabled","url":"http://r640-1-10-mlx.mlx:42304"})
DEBUG [2019-10-17 09:29:20,586] ({pool-5-thread-2}
RemoteInterpreterEventClient.java[pollEvent]:366) - Send event META_INFOS
DEBUG [2019-10-17 09:29:20,684] ({pool-5-thread-3}
Interpreter.java[getProperty]:222) - key: zeppelin.spark.concurrentSQL, value:
false
DEBUG [2019-10-17 09:29:20,684] ({pool-5-thread-3}
Interpreter.java[getProperty]:222) - key: zeppelin.spark.concurrentSQL, value:
false
ERROR [2019-10-17 09:29:25,301] ({SIGTERM handler}
SignalUtils.scala[apply$mcZ$sp]:43) - RECEIVED SIGNAL TERM
INFO [2019-10-17 09:29:25,308] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Invoking stop() from shutdown hook
INFO [2019-10-17 09:29:25,327] ({shutdown-hook-0}
AbstractConnector.java[doStop]:318) - Stopped
Spark@21f71c23{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
INFO [2019-10-17 09:29:25,331] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Stopped Spark web UI at http://r640-1-10-mlx.mlx:42304
INFO [2019-10-17 09:29:25,387] ({dispatcher-event-loop-44}
Logging.scala[logInfo]:54) - Driver requested a total number of 0 executor(s).
INFO [2019-10-17 09:29:25,389] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Shutting down all executors
INFO [2019-10-17 09:29:25,392] ({dispatcher-event-loop-51}
Logging.scala[logInfo]:54) - Asking each executor to shut down
INFO [2019-10-17 09:29:25,398] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
ERROR [2019-10-17 09:29:25,407] ({Reporter} Logging.scala[logError]:91) -
Exception from Reporter thread.
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException:
Application attempt appattempt_1570749574365_0027_000001 doesn't exist in
ApplicationMasterService cache.
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
at
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy21.allocate(Unknown Source)
at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:320)
at
org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:268)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:556)
Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException):
Application attempt appattempt_1570749574365_0027_000001 doesn't exist in
ApplicationMasterService cache.
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
at org.apache.hadoop.ipc.Client.call(Client.java:1443)
at org.apache.hadoop.ipc.Client.call(Client.java:1353)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy20.allocate(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
... 13 more
INFO [2019-10-17 09:29:25,411] ({dispatcher-event-loop-55}
Logging.scala[logInfo]:54) - MapOutputTrackerMasterEndpoint stopped!
INFO [2019-10-17 09:29:25,412] ({Reporter} Logging.scala[logInfo]:54) - Final
app status: FAILED, exitCode: 12, (reason: Application attempt
appattempt_1570749574365_0027_000001 doesn't exist in ApplicationMasterService
cache.
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
)
INFO [2019-10-17 09:29:25,439] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
MemoryStore cleared
INFO [2019-10-17 09:29:25,440] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
BlockManager stopped
INFO [2019-10-17 09:29:25,441] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
BlockManagerMaster stopped
INFO [2019-10-17 09:29:25,445] ({dispatcher-event-loop-63}
Logging.scala[logInfo]:54) - OutputCommitCoordinator stopped!
INFO [2019-10-17 09:29:25,459] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Successfully stopped SparkContext
INFO [2019-10-17 09:29:25,461] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Shutdown hook called
INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Deleting directory
/d0/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-8328975e-fd10-4676-9b51-c800253c1016/pyspark-55a3ee94-1620-423b-8564-a0f2a7b1da76
INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Deleting directory
/d0/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-8328975e-fd10-4676-9b51-c800253c1016
INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Deleting directory
/d1/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-f0f652c9-2a8a-4235-818d-9f9165890c52
End of LogType:stdout
Exception 3:
INFO [2019-10-17 09:29:28,077] ({main} Logging.scala[logInfo]:54) - Registered
signal handler for TERM
INFO [2019-10-17 09:29:28,079] ({main} Logging.scala[logInfo]:54) - Registered
signal handler for HUP
INFO [2019-10-17 09:29:28,080] ({main} Logging.scala[logInfo]:54) - Registered
signal handler for INT
INFO [2019-10-17 09:29:28,193] ({main} Logging.scala[logInfo]:54) - Changing
view acls to: mansop
INFO [2019-10-17 09:29:28,193] ({main} Logging.scala[logInfo]:54) - Changing
modify acls to: mansop
INFO [2019-10-17 09:29:28,194] ({main} Logging.scala[logInfo]:54) - Changing
view acls groups to:
INFO [2019-10-17 09:29:28,194] ({main} Logging.scala[logInfo]:54) - Changing
modify acls groups to:
INFO [2019-10-17 09:29:28,195] ({main} Logging.scala[logInfo]:54) -
SecurityManager: authentication disabled; ui acls disabled; users with view
permissions: Set(mansop); groups with view permissions: Set(); users with
modify permissions: Set(mansop); groups with modify permissions: Set()
INFO [2019-10-17 09:29:28,493] ({main} Logging.scala[logInfo]:54) - Preparing
Local resources
INFO [2019-10-17 09:29:29,195] ({main} Logging.scala[logInfo]:54) -
ApplicationAttemptId: appattempt_1570749574365_0027_000002
INFO [2019-10-17 09:29:29,219] ({main} Logging.scala[logInfo]:54) - Starting
the user application in a separate Thread
INFO [2019-10-17 09:29:29,241] ({main} Logging.scala[logInfo]:54) - Waiting for
spark context initialization...
INFO [2019-10-17 09:29:29,243] ({Driver}
RemoteInterpreterServer.java[main]:260) -
URL:jar:file:/d1/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/container_e15_1570749574365_0027_02_000001/__app__.jar!/org/apache/zeppelin/interpreter/remote/RemoteInterpreterServer.class
INFO [2019-10-17 09:29:29,311] ({Driver}
RemoteInterpreterServer.java[<init>]:161) - Launching ThriftServer at
10.0.1.12:41438
INFO [2019-10-17 09:29:29,319] ({Driver}
RemoteInterpreterServer.java[<init>]:165) - Starting remote interpreter server
on port 41438
INFO [2019-10-17 09:29:29,322] ({Thread-8}
RemoteInterpreterServer.java[run]:202) - Starting remote interpreter server on
port 41438
INFO [2019-10-17 09:29:30,340] ({Thread-9}
RemoteInterpreterUtils.java[registerInterpreter]:165) - callbackHost:
10.0.11.18, callbackPort: 42835, callbackInfo: CallbackInfo(host:10.0.1.12,
port:41438)
ERROR [2019-10-17 09:31:09,266] ({main} Logging.scala[logError]:91) - Uncaught
exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498)
at
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at
org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
INFO [2019-10-17 09:31:09,273] ({main} Logging.scala[logInfo]:54) - Final app
status: FAILED, exitCode: 13, (reason: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds])
INFO [2019-10-17 09:31:09,284] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Deleting staging directory
hdfs://gl-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1570749574365_0027
INFO [2019-10-17 09:31:09,299] ({shutdown-hook-0} Logging.scala[logInfo]:54) -
Shutdown hook called
End of LogType:stdout
Could someone please help to understand what these errors means?
Manuel Sopena Ballesteros
Big Data Engineer | Kinghorn Centre for Clinical Genomics
[cid:[email protected]] <https://www.garvan.org.au/>
a: 384 Victoria Street, Darlinghurst NSW 2010
p: +61 2 9355 5760 | +61 4 12 123 123
e: [email protected]<mailto:[email protected]>
Like us on Facebook<http://www.facebook.com/garvaninstitute> | Follow us on
Twitter<http://twitter.com/GarvanInstitute> and
LinkedIn<http://www.linkedin.com/company/garvan-institute-of-medical-research>
NOTICE
Please consider the environment before printing this email. This message and
any attachments are intended for the addressee named and may contain legally
privileged/confidential/copyright information. If you are not the intended
recipient, you should not read, use, disclose, copy or distribute this
communication. If you have received this message in error please notify us at
once by return email and then delete both messages. We accept no liability for
the distribution of viruses or similar in electronic communications. This
notice should not be removed.