[
https://issues.apache.org/jira/browse/HADOOP-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003987#comment-14003987
]
Jason Lowe commented on HADOOP-10622:
-------------------------------------
Saw this while running the TestNodeManagerResync.testKillContainersOnResync
unit test, although the nature of the deadlock looks like it could happen in
other scenarios as well.
{noformat}
Found one Java-level deadlock:
=============================
"Thread-163":
waiting to lock monitor 0x00007f4e38086b60 (object 0x00000000ebab1508, a
java.lang.UNIXProcess$ProcessPipeInputStream),
which is held by "LocalizerRunner for container_0_0000_01_000000"
"LocalizerRunner for container_0_0000_01_000000":
waiting to lock monitor 0x00007f4e380855b8 (object 0x00000000ebab3620, a
java.io.InputStreamReader),
which is held by "Thread-163"
Java stack information for the threads listed above:
===================================================
"Thread-163":
at java.io.BufferedInputStream.read(BufferedInputStream.java:325)
- waiting to lock <0x00000000ebab1508> (a
java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0x00000000ebab3620> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
- locked <0x00000000ebab3620> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at org.apache.hadoop.util.Shell$1.run(Shell.java:506)
"LocalizerRunner for container_0_0000_01_000000":
at java.io.BufferedReader.close(BufferedReader.java:515)
- waiting to lock <0x00000000ebab3620> (a java.io.InputStreamReader)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:574)
- locked <0x00000000ebab1508> (a
java.lang.UNIXProcess$ProcessPipeInputStream)
at org.apache.hadoop.util.Shell.run(Shell.java:452)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:684)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:773)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:756)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:288)
at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1012)
at
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
at
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
at
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:666)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:662)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.create(FileContext.java:662)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1105)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1068)
Found 1 deadlock.
{noformat}
Shell.runCommand has a lock on the stderr InputStream and is trying to call
close() on it while the errThread it spawned earlier in the method is trying to
read from the same stream. The method tries to join with the errThread before
closing, but it appears this was aborted by an InterruptedException in the case
where it deadlocked (probably because the container was being killed in the
unit test). Here's the relevant snippet from the unit test log showing the
method being interrupted:
{noformat}
2014-05-20 20:48:40,053 INFO [Thread-162] nodemanager.NodeManager
(NodeManager.java:run(262)) - Cleaning up running containers on resync
2014-05-20 20:48:40,053 INFO [Thread-162]
containermanager.ContainerManagerImpl
(ContainerManagerImpl.java:cleanupContainersOnNMResync(376)) - Containers still
running on ON_NODEMANAGER_RESYNC : [container_0_0000_01_000000]
2014-05-20 20:48:40,053 INFO [Thread-162]
containermanager.ContainerManagerImpl
(ContainerManagerImpl.java:cleanupContainersOnNMResync(383)) - Waiting for
containers to be killed
2014-05-20 20:48:40,054 INFO [AsyncDispatcher event handler]
container.Container (ContainerImpl.java:handle(901)) - Container
container_0_0000_01_000000 transitioned from LOCALIZING to KILLING
2014-05-20 20:48:40,057 WARN [LocalizerRunner for container_0_0000_01_000000]
util.Shell (Shell.java:runCommand(533)) - Interrupted while reading the error
stream
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1260)
at java.lang.Thread.join(Thread.java:1334)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:531)
at org.apache.hadoop.util.Shell.run(Shell.java:452)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:684)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:773)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:756)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:288)
at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1012)
at
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
at
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
at
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:666)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:662)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.create(FileContext.java:662)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1105)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1068)
{noformat}
It looks like we need to either be a little more persistent in trying to join
with the errThread before entering the finally block where we lock and try to
close the input stream, or we need to rethink the locking scheme that was added
in HADOOP-10146.
> Shell.runCommand can deadlock
> -----------------------------
>
> Key: HADOOP-10622
> URL: https://issues.apache.org/jira/browse/HADOOP-10622
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Priority: Critical
>
> Ran into a deadlock in Shell.runCommand. Stacktrace details to follow.
--
This message was sent by Atlassian JIRA
(v6.2#6252)