In my most recent experience, I have some SSDs in compute nodes that
occasionally just drop off the bus, so the compute node loses its OS disk.
I haven't thought about it too hard, but the default NHC scripts do not
notice that. Similarly, Paul's proposed script might need to also check
that the s
Since you can run an arbitrary script as a node health checker I might
add a script that counts failures and then closes if it hits a
threshold. The script shouldn't need to talk to the slurmctld or
slurmdbd as it should be able to watch the log on the node and see the fail.
-Paul Edmon-
On
Hello,
how do you implement something like "drain host after 10 consecutive
failed jobs"? Unlike a host check script, that checks for known errors,
I'd like to stop killing jobs just because one node is faulty.
Gerhard