[ 
https://issues.apache.org/jira/browse/HADOOP-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003987#comment-14003987
 ] 

Jason Lowe commented on HADOOP-10622:
-------------------------------------

Saw this while running the TestNodeManagerResync.testKillContainersOnResync 
unit test, although the nature of the deadlock looks like it could happen in 
other scenarios as well.

{noformat}
Found one Java-level deadlock:
=============================
"Thread-163":
  waiting to lock monitor 0x00007f4e38086b60 (object 0x00000000ebab1508, a 
java.lang.UNIXProcess$ProcessPipeInputStream),
  which is held by "LocalizerRunner for container_0_0000_01_000000"
"LocalizerRunner for container_0_0000_01_000000":
  waiting to lock monitor 0x00007f4e380855b8 (object 0x00000000ebab3620, a 
java.io.InputStreamReader),
  which is held by "Thread-163"

Java stack information for the threads listed above:
===================================================
"Thread-163":
        at java.io.BufferedInputStream.read(BufferedInputStream.java:325)
        - waiting to lock <0x00000000ebab1508> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
        - locked <0x00000000ebab3620> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:154)
        at java.io.BufferedReader.readLine(BufferedReader.java:317)
        - locked <0x00000000ebab3620> (a java.io.InputStreamReader)
        at java.io.BufferedReader.readLine(BufferedReader.java:382)
        at org.apache.hadoop.util.Shell$1.run(Shell.java:506)
"LocalizerRunner for container_0_0000_01_000000":
        at java.io.BufferedReader.close(BufferedReader.java:515)
        - waiting to lock <0x00000000ebab3620> (a java.io.InputStreamReader)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:574)
        - locked <0x00000000ebab1508> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
        at org.apache.hadoop.util.Shell.run(Shell.java:452)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:684)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:773)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:756)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:288)
        at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1012)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
        at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
        at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
        at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:666)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:662)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.create(FileContext.java:662)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1105)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1068)

Found 1 deadlock.
{noformat}

Shell.runCommand has a lock on the stderr InputStream and is trying to call 
close() on it while the errThread it spawned earlier in the method is trying to 
read from the same stream.  The method tries to join with the errThread before 
closing, but it appears this was aborted by an InterruptedException in the case 
where it deadlocked (probably because the container was being killed in the 
unit test).  Here's the relevant snippet from the unit test log showing the 
method being interrupted:

{noformat}
2014-05-20 20:48:40,053 INFO  [Thread-162] nodemanager.NodeManager 
(NodeManager.java:run(262)) - Cleaning up running containers on resync
2014-05-20 20:48:40,053 INFO  [Thread-162] 
containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:cleanupContainersOnNMResync(376)) - Containers still 
running on ON_NODEMANAGER_RESYNC : [container_0_0000_01_000000]
2014-05-20 20:48:40,053 INFO  [Thread-162] 
containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:cleanupContainersOnNMResync(383)) - Waiting for 
containers to be killed
2014-05-20 20:48:40,054 INFO  [AsyncDispatcher event handler] 
container.Container (ContainerImpl.java:handle(901)) - Container 
container_0_0000_01_000000 transitioned from LOCALIZING to KILLING
2014-05-20 20:48:40,057 WARN  [LocalizerRunner for container_0_0000_01_000000] 
util.Shell (Shell.java:runCommand(533)) - Interrupted while reading the error 
stream
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1260)
        at java.lang.Thread.join(Thread.java:1334)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:531)
        at org.apache.hadoop.util.Shell.run(Shell.java:452)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:684)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:773)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:756)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:288)
        at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1012)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
        at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
        at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
        at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:666)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:662)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.create(FileContext.java:662)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1105)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1068)
{noformat}

It looks like we need to either be a little more persistent in trying to join 
with the errThread before entering the finally block where we lock and try to 
close the input stream, or we need to rethink the locking scheme that was added 
in HADOOP-10146.

> Shell.runCommand can deadlock
> -----------------------------
>
>                 Key: HADOOP-10622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10622
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> Ran into a deadlock in Shell.runCommand.  Stacktrace details to follow.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to