Chris: thanks for that tip, I'm having a look at that now, it sounds promising.
Run on the login node I get: scontrol show config | fgrep KillOnBadExit KillOnBadExit = 0 I've tried to put -K0 in to a job to see if that helps. But doing it on the command line sbatch -K0 job_name.job gives an error sbatch: invalid option -- 'K' and putting #SBATCH --kill-on-bad-exit=0 in the top of the .job file gives the error sbatch: unrecognized option '--kill-on-bad-exit=0' Loris, thanks also. If Chris's tip can't solve my issues I'll post more detaeeld discussions of the software I'm working with, but it can get quite confusing and took me a long time to get this software running on the cluster even in the "one per node" form I currently use it, hence my preference to tinker with SLURM settings rather than try to change the software. On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <ch...@csamuel.org> wrote: > Hi Robert, > > On 4/16/21 12:39 pm, Robert Peck wrote: > > > Please can anyone suggest how to instruct SLURM not to massacre ALL my > > jobs because ONE (or a few) node(s) fails? > > You will also probably want this for your srun: --kill-on-bad-exit=0 > > What does the scontrol command below show? > > scontrol show config | fgrep KillOnBadExit > > From the manual page: > > -K, --kill-on-bad-exit[=0|1] > Controls whether or not to terminate a step if any task > exits with a non-zero exit code. If this option is not > specified, the default action will be based upon > the Slurm configuration parameter of KillOnBadExit. > If this option is specified, it will take precedence over > KillOnBadExit. An option argument of zero will not > terminate the job. A non-zero argument or no argument > will terminate the job. Note: This option takes > precedence over the -W, --wait option to terminate the > job immediately if a task exits with a non-zero exit > code. Since this option's argument is optional, for > proper parsing the single letter option must be followed > immediately with the value and not include a space between > them. For example "-K1" and not "-K 1". > > > Best of luck, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > > -- > You received this message because you are subscribed to a topic in the > Google Groups "slurm-users" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > slurm-users+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org > . > -- Thanks ---------- Robert Peck *Robot Lab* *Intelligent Systems and Nanoscience group* *Department of Electronic Engineering* *University of York*