Re: [slurm-users] NHC and slurm

2021-04-20 Thread Michael Jennings
On Thursday, 15 April 2021, at 10:58:31 (-0300), Heitor wrote: > I'm trying to setup NHC[0] for our Slurm cluster, but I'm not > getting it to work properly. Just for future reference, NHC has its own mailing lists, and even though your question does relate to Slurm tangentially, it's really an N

[slurm-users] Potential Slurm User Group 2021 Survey

2021-04-20 Thread Jacob Jenson
SchedMD is currently planning the annual Slurm User Group Meeting. We would like to hold the meeting in person in the Salt Lake City, Utah, USA area. Could you please take a moment to fill out the 3 question survey, link below, to help us know if you would be able to attend an in person Slurm User

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-20 Thread Robert Peck
Submission did succeed when I put the -K0 inside the srun command within my job script though. Will be a while before my job runs though, so won't know for a little while whether the KillOnBadExit flag has helped. On Tue, 20 Apr 2021 at 16:57, Robert Peck wrote: > P.S. the slurm version here is

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-20 Thread Robert Peck
P.S. the slurm version here is 20.02.3 On Tue, 20 Apr 2021 at 16:55, Robert Peck wrote: > Chris: thanks for that tip, I'm having a look at that now, it sounds > promising. > > Run on the login node I get: > scontrol show config | fgrep KillOnBadExit > KillOnBadExit = 0 > > I've tried t

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-20 Thread Robert Peck
Chris: thanks for that tip, I'm having a look at that now, it sounds promising. Run on the login node I get: scontrol show config | fgrep KillOnBadExit KillOnBadExit = 0 I've tried to put -K0 in to a job to see if that helps. But doing it on the command line sbatch -K0 job_name.job giv