cnauroth commented on code in PR #7857:
URL: https://github.com/apache/hadoop/pull/7857#discussion_r2261601292
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuDiscoverer.java:
##########
@@ -297,6 +297,36 @@ public void
testGetGpuDeviceInformationFaultyNvidiaSmiScriptConsecutiveRun()
assertNotNull(discoverer.getGpusUsableByYarn());
}
+ @Test
+ public void testGetGpuDeviceInformationDisableMaxErrors()
+ throws YarnException, IOException {
+ Configuration conf = new Configuration(false);
+ // A negative value should disable max errors enforcement.
+ conf.setInt(YarnConfiguration.NM_GPU_DISCOVERY_MAX_ERRORS, -1);
+
+ File fakeBinary = createFakeNvidiaSmiScriptAsRunnableFile(
+ this::createFaultyNvidiaSmiScript);
+
+ GpuDiscoverer discoverer = creatediscovererWithGpuPathDefined(conf);
+ assertEquals(fakeBinary.getAbsolutePath(),
+ discoverer.getPathOfGpuBinary());
+ assertNull(discoverer.getEnvironmentToRunCommand().get(PATH));
+
+ final String terminateMsg = "Failed to execute GPU device " +
+ "detection script (" + fakeBinary.getAbsolutePath() + ") for 10 times";
+ final String msg = "Failed to execute GPU device detection script";
+
+ // The default max errors is 10. Verify that it keeps going for an 11th
try.
+ for (int i = 0; i < 11; ++i) {
Review Comment:
This test is covering the case where you disable the max errors by setting a
negative value. To make this clearer, I dialed it up to 20 attempts, and I also
added another test that sets the configuration to 11 and confirms it tries
exactly 11 times.
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml:
##########
@@ -4650,6 +4650,34 @@
<value></value>
</property>
+ <property>
+ <description>
+ Sets the maximum duration for executions of the discovery binary defined
in
+ yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables. If
+ the binary takes longer than this amount of time to run, then the process
+ is aborted. Discovery may be attempted again, depending on
+ yarn.nodemanager.resource-plugins.gpu.discovery-max-errors.
+ </description>
+ <name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name>
+ <value>10000ms</value>
Review Comment:
Sounds good, updated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]