Submission did succeed when I put the -K0 inside the srun command within my job script though. Will be a while before my job runs though, so won't know for a little while whether the KillOnBadExit flag has helped.
On Tue, 20 Apr 2021 at 16:57, Robert Peck <rp1...@york.ac.uk> wrote: > P.S. the slurm version here is 20.02.3 > > On Tue, 20 Apr 2021 at 16:55, Robert Peck <rp1...@york.ac.uk> wrote: > >> Chris: thanks for that tip, I'm having a look at that now, it sounds >> promising. >> >> Run on the login node I get: >> scontrol show config | fgrep KillOnBadExit >> KillOnBadExit = 0 >> >> I've tried to put -K0 in to a job to see if that helps. >> But doing it on the command line >> sbatch -K0 job_name.job >> gives an error >> sbatch: invalid option -- 'K' >> and putting >> #SBATCH --kill-on-bad-exit=0 >> in the top of the .job file gives the error >> sbatch: unrecognized option '--kill-on-bad-exit=0' >> >> >> >> >> >> Loris, thanks also. If Chris's tip can't solve my issues I'll post more >> detaeeld discussions of the software I'm working with, but it can get quite >> confusing and took me a long time to get this software running on the >> cluster even in the "one per node" form I currently use it, hence my >> preference to tinker with SLURM settings rather than try to change the >> software. >> >> On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <ch...@csamuel.org> >> wrote: >> >>> Hi Robert, >>> >>> On 4/16/21 12:39 pm, Robert Peck wrote: >>> >>> > Please can anyone suggest how to instruct SLURM not to massacre ALL my >>> > jobs because ONE (or a few) node(s) fails? >>> >>> You will also probably want this for your srun: --kill-on-bad-exit=0 >>> >>> What does the scontrol command below show? >>> >>> scontrol show config | fgrep KillOnBadExit >>> >>> From the manual page: >>> >>> -K, --kill-on-bad-exit[=0|1] >>> Controls whether or not to terminate a step if any task >>> exits with a non-zero exit code. If this option is not >>> specified, the default action will be based upon >>> the Slurm configuration parameter of KillOnBadExit. >>> If this option is specified, it will take precedence over >>> KillOnBadExit. An option argument of zero will not >>> terminate the job. A non-zero argument or no argument >>> will terminate the job. Note: This option takes >>> precedence over the -W, --wait option to terminate the >>> job immediately if a task exits with a non-zero exit >>> code. Since this option's argument is optional, for >>> proper parsing the single letter option must be followed >>> immediately with the value and not include a space between >>> them. For example "-K1" and not "-K 1". >>> >>> >>> Best of luck, >>> Chris >>> -- >>> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA >>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "slurm-users" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> slurm-users+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org >>> . >>> >> >> >> -- >> Thanks >> >> ---------- >> >> Robert Peck >> >> >> *Robot Lab* >> *Intelligent Systems and Nanoscience group* >> *Department of Electronic Engineering* >> *University of York* >> > > > -- > Thanks > > ---------- > > Robert Peck > > > *Robot Lab* > *Intelligent Systems and Nanoscience group* > *Department of Electronic Engineering* > *University of York* > -- Thanks ---------- Robert Peck *Robot Lab* *Intelligent Systems and Nanoscience group* *Department of Electronic Engineering* *University of York*