P.S. the slurm version here is 20.02.3 On Tue, 20 Apr 2021 at 16:55, Robert Peck <rp1...@york.ac.uk> wrote:
> Chris: thanks for that tip, I'm having a look at that now, it sounds > promising. > > Run on the login node I get: > scontrol show config | fgrep KillOnBadExit > KillOnBadExit = 0 > > I've tried to put -K0 in to a job to see if that helps. > But doing it on the command line > sbatch -K0 job_name.job > gives an error > sbatch: invalid option -- 'K' > and putting > #SBATCH --kill-on-bad-exit=0 > in the top of the .job file gives the error > sbatch: unrecognized option '--kill-on-bad-exit=0' > > > > > > Loris, thanks also. If Chris's tip can't solve my issues I'll post more > detaeeld discussions of the software I'm working with, but it can get quite > confusing and took me a long time to get this software running on the > cluster even in the "one per node" form I currently use it, hence my > preference to tinker with SLURM settings rather than try to change the > software. > > On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <ch...@csamuel.org> > wrote: > >> Hi Robert, >> >> On 4/16/21 12:39 pm, Robert Peck wrote: >> >> > Please can anyone suggest how to instruct SLURM not to massacre ALL my >> > jobs because ONE (or a few) node(s) fails? >> >> You will also probably want this for your srun: --kill-on-bad-exit=0 >> >> What does the scontrol command below show? >> >> scontrol show config | fgrep KillOnBadExit >> >> From the manual page: >> >> -K, --kill-on-bad-exit[=0|1] >> Controls whether or not to terminate a step if any task >> exits with a non-zero exit code. If this option is not >> specified, the default action will be based upon >> the Slurm configuration parameter of KillOnBadExit. >> If this option is specified, it will take precedence over >> KillOnBadExit. An option argument of zero will not >> terminate the job. A non-zero argument or no argument >> will terminate the job. Note: This option takes >> precedence over the -W, --wait option to terminate the >> job immediately if a task exits with a non-zero exit >> code. Since this option's argument is optional, for >> proper parsing the single letter option must be followed >> immediately with the value and not include a space between >> them. For example "-K1" and not "-K 1". >> >> >> Best of luck, >> Chris >> -- >> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "slurm-users" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> slurm-users+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org >> . >> > > > -- > Thanks > > ---------- > > Robert Peck > > > *Robot Lab* > *Intelligent Systems and Nanoscience group* > *Department of Electronic Engineering* > *University of York* > -- Thanks ---------- Robert Peck *Robot Lab* *Intelligent Systems and Nanoscience group* *Department of Electronic Engineering* *University of York*