magnuma3 opened a new pull request, #8474:
URL: https://github.com/apache/hadoop/pull/8474

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   
   When a user container exits with code 22 or 24, the NodeManager becomes 
unhealthy and no more containers are allocated to that node. This situation can 
be resolved by restarting the NodeManager.
    
   It can be reproduced immediately by running Scala Spark wordcount job that 
exits with code 22.
    
   I propose to fix this by wrapping exit code 22 or 24 with different exit 
code, so that ConfigurationException that causes NodeManager to become 
unhealthy is not triggered.
    
   ```
   2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(532)) - Obtaining the exit code...
   2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(532)) - Docker inspect command: 
/usr/bin/docker inspect --format {{.State.ExitCode}} 
container_e161_1711009858797_8304894_01_000015
   2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(532)) - Exit code from docker inspect: 22
   2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(532)) - Wrote the exit code 22 to 
/data6/hadoop/yarn/local/nmPrivate/application_1711009858797_8304894/container_e161_1711009858797_8304894_01_000015/container_e161_1711009858797_8304894_01_000015.pid.exitcode
   2024-09-23 18:50:14,381 ERROR launcher.ContainerLaunch 
(ContainerLaunch.java:call(340)) - Failed to launch container due to 
configuration error.
   org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container 
Executor reached unrecoverable exception
           at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:615)
           at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
           at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
           at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:513)
           at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:323)
           at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:106)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:745)
   Caused by: 
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Launch container failed
           at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:1099)
           at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:166)
           at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:564)
           ... 8 more 
   ```
   
   ### How was this patch tested?
   
   It can be reproduced immediately by running Scala Spark wordcount job that 
exits with code 22.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   ### AI Tooling
   
   If an AI tool was used:
   
   - [ ] The PR includes the phrase "Contains content generated by <tool>"
         where <tool> is the name of the AI tool used.
   - [ ] My use of AI contributions follows the ASF legal policy
         https://www.apache.org/legal/generative-tooling.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to