Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-22 Thread Marcus Boden
you can also use the UnkillableStepProgram to debug things: > UnkillableStepProgram > If the processes in a job step are determined to be unkillable for a > period of time specified by the UnkillableStepTimeout variable, the program > specified by UnkillableStepProgram will be executed. This

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-22 Thread Paul Edmon
It can also happen if you have a stalled out filesystem or stuck processes.  I've gotten in the habit of doing a daily patrol for them to clean them up.  Most of them time you can just reopen the node but sometimes this indicates something is wedged. -Paul Edmon- On 10/22/2019 5:22 PM, Riebs,

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-22 Thread Riebs, Andy
A common reason for seeing this is if a process is dropping core -- the kernel will ignore job kill requests until that is complete, so the job isn't being killed as quickly as Slurm would like. I typically recommend increasing the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avo

[slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-22 Thread Will Dennis
Hi all, I have a number of nodes on one of my 17.11.7 clusters in drain mode on account of reason: "Kill task failed” I see the following in slurmd.log — [2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT *** [2019-10-17T2

Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-22 Thread Juergen Salk
Dear Chris, I could not find this warning in the slurm.conf man page. So I googled it and found a reference in the Slurm developers documentation: https://slurm.schedmd.com/jobacct_gatherplugins.html However, this web page says in its footer: "Last modified 27 March 2015". So maybe (means: hop

[slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-22 Thread Christopher Benjamin Coffey
Hi, We've been using jobacct_gather/cgroup for quite some time and haven't had any issues (I think). We do see some lengthy job cleanup times when there are lots of small jobs completing at once, maybe that is due to the cgroup plugin. At SLUG19 a slurm dev presented information that the jobacc

Re: [slurm-users] Interlocking / Concurrent runner

2019-10-22 Thread Reuti
Hi, I'm more used to GridEngine but use SLURM at remote locations. What I miss in SLURM are two features which I think are related to this area: • use job names with wildcards for the various command like `scancel` • use job names with wildcards for --dependency If I get you right, you would l

Re: [slurm-users] Interlocking / Concurrent runner

2019-10-22 Thread Florian Lohoff
On Tue, Oct 22, 2019 at 12:06:57PM +0300, mercan wrote: > Hi; > > You can use the "--dependency=afterok:jobid:jobid ..." parameter of the > sbatch to ensure the new submitted job will be waiting until all older jobs > are finished. Simply, you can submit the new job even while older jobs are > run

Re: [slurm-users] Interlocking / Concurrent runner

2019-10-22 Thread mercan
Hi; You can use the "--dependency=afterok:jobid:jobid ..." parameter of the sbatch to ensure the new submitted job will be waiting until all older jobs are finished. Simply, you can submit the new job even while older jobs are running, the new job will not start before old jobs are finished.

[slurm-users] Interlocking / Concurrent runner

2019-10-22 Thread Florian Lohoff
Hi, i am using slurm in a single node job batching system. Slurm ist perfect for that case and works for a couple years flawlessly. Lately i was shuffleing around jobs which take much longer to run to only run daily, and other jobs to run more frequently. A Question i had was - is there a possib