[
https://issues.apache.org/jira/browse/HADOOP-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450470#comment-16450470
]
Miklos Szegedi commented on HADOOP-15372:
-----------------------------------------
Thank you for the patch [~ebadger].
{code:java}
shell.getProcess().destroy();{code}
I see that you use SIGTERM instead of SIGKILL. This may still leak child
processes.
{code:java}
destroyShellProcesses(getAllShells());
if(!exec.awaitTermination(10, TimeUnit.SECONDS)) {
destroyShellProcesses(getAllShells());
}{code}
This may still leak shells, if the creator thread is waiting itself on
something else and it creates the subshell, once the second {{getAllShells()}}
is called.
I have a general concern with the original patch that any other users of the
Shell class may need to change their code to use the same pattern. There are at
least 5 other users of Shell in the codebase.
Have you considered using process groups? Process groups may also solve the
case when the node manager signals the container localizer process with
SIGKILL. Currently, if this happens the shells will leak.
> Race conditions and possible leaks in the Shell class
> -----------------------------------------------------
>
> Key: HADOOP-15372
> URL: https://issues.apache.org/jira/browse/HADOOP-15372
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.10.0, 3.2.0
> Reporter: Miklos Szegedi
> Assignee: Eric Badger
> Priority: Minor
> Attachments: HADOOP-15372.001.patch
>
>
> YARN-5641 introduced some cleanup code in the Shell class. It has a race
> condition. {{Shell.
> runCommand()}} can be called while/after {{Shell.getAllShells()}} returned
> all the shells to be cleaned up. This new thread can avoid the clean up, so
> that the process held by it can be leaked causing leaked localized files/etc.
> I see another issue as well. {{Shell.runCommand()}} has a finally block with
> a {{
> process.destroy();}} to clean up. However, the try catch block does not cover
> all instructions after the process is started, so for example we can exit the
> thread and leak the process, if {{
> timeOutTimer.schedule(timeoutTimerTask, timeOutInterval);}} causes an
> exception.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]