Hi Robert, On 4/16/21 12:39 pm, Robert Peck wrote:
Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs because ONE (or a few) node(s) fails?
You will also probably want this for your srun: --kill-on-bad-exit=0 What does the scontrol command below show? scontrol show config | fgrep KillOnBadExit From the manual page: -K, --kill-on-bad-exit[=0|1] Controls whether or not to terminate a step if any task exits with a non-zero exit code. If this option is not specified, the default action will be based upon the Slurm configuration parameter of KillOnBadExit. If this option is specified, it will take precedence over KillOnBadExit. An option argument of zero will not terminate the job. A non-zero argument or no argument will terminate the job. Note: This option takes precedence over the -W, --wait option to terminate the job immediately if a task exits with a non-zero exit code. Since this option's argument is optional, for proper parsing the single letter option must be followed immediately with the value and not include a space between them. For example "-K1" and not "-K 1". Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA