Hi,
I think I met with a possible deadlock situation. Not sure whether it is
actually a deadlock or not :-)
Here is my scenario:
Run a Job and call JobClient.monitorAndPrintJob to monitor the job and get the
status update.
In parallel try to invoke the JobClient$NetworkedJob.killJob.
For reference I am attaching the Thread dump for both the operation:
"MrPlanRunner" daemon prio=5 tid=7fe12cacf000 nid=0x11352f000 in Object.wait()
[11352d000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <7f3c55668> (a org.apache.hadoop.ipc.Client$Call)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client.call(Client.java:1145)
- locked <7f3c55668> (a org.apache.hadoop.ipc.Client$Call)
at org.apache.hadoop.ipc.Client.call(Client.java:1122)
at
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:148)
at $Proxy40.getApplicationReport(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClientImpl.java:116)
at
org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:343)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:143)
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:296)
- locked <7f4d78950> (a org.apache.hadoop.mapred.ClientServiceDelegate)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:373)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:483)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:322)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:319)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:319)
- locked <7f4f70fc0> (a org.apache.hadoop.mapreduce.Job)
at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:598)
at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1280)
at
org.apache.hadoop.mapred.JobClient$NetworkedJob.monitorAndPrintJob(JobClient.java:432)
at
org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:902)
at xxxxx.runJob(xxxxx.java:74)
at xxxxx.doExecute(xxxxx.java:39)
at xxxxx.doExecute(xxxxx.java:1)
at xxxxexecute(xxxxxx.java:29)
at xxxx.MrPlanRunner.run(xxxxx.java:117)
at java.lang.Thread.run(Thread.java:680)
"Thread-2" prio=5 tid=7fe12e2de800 nid=0x114d15000 waiting for monitor entry
[114d13000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:286)
- waiting to lock <7f4d78950> (a
org.apache.hadoop.mapred.ClientServiceDelegate)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:373)
at org.apache.hadoop.mapred.YARNRunner.killJob(YARNRunner.java:509)
at org.apache.hadoop.mapreduce.Job.killJob(Job.java:622)
at
org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:319)
- locked <7f4f8fa68> (a org.apache.hadoop.mapred.JobClient$NetworkedJob)
at xxxx.cancelCurrentJob(xxxxxx.java:150)
at xxxx.cancel(xxxxx.java:171)
at xxxx.testCancelJob(xxxx.java:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46)
at
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62)
In the thread dump we can observe the object "7f4d78950" is being locked by
MrPlanRunner(Thread calling JobClient.monitorAndPrintJob) thread and
Thread-2(Thread calling JobClient$NetworkedJob.killJob) is trying to make an
attempt to lock the same object and gets Blocked.
Please let me know if this a possible problem in the code or the usage of API
is incorrect.
The build being used is:0.23.1-cdh4.0.0b2
Cheers,
Subroto Sanyal