Hi Robert,

On 4/16/21 12:39 pm, Robert Peck wrote:

Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs because ONE (or a few) node(s) fails?

You will also probably want this for your srun: --kill-on-bad-exit=0

What does the scontrol command below show?

scontrol show config | fgrep KillOnBadExit

From the manual page:

       -K, --kill-on-bad-exit[=0|1]
              Controls whether or not to terminate a step if any task
              exits with a non-zero exit code. If this option is not
              specified, the default action will  be  based  upon
              the  Slurm  configuration parameter of KillOnBadExit.
              If this option is specified, it will take precedence over
              KillOnBadExit. An option argument of zero will not
              terminate the job. A non-zero argument or no argument
              will terminate the job.  Note: This option takes
              precedence over the -W, --wait option to terminate the
              job immediately  if  a  task  exits with a non-zero exit
              code.  Since this option's argument is optional, for
              proper parsing the single letter option must be followed
              immediately with the value and not include a space between
              them. For example "-K1" and not "-K 1".


Best of luck,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to