you can also use the UnkillableStepProgram to debug things:
> UnkillableStepProgram
> If the processes in a job step are determined to be unkillable for a
> period of time specified by the UnkillableStepTimeout variable, the program
> specified by UnkillableStepProgram will be executed. This
It can also happen if you have a stalled out filesystem or stuck
processes. I've gotten in the habit of doing a daily patrol for them to
clean them up. Most of them time you can just reopen the node but
sometimes this indicates something is wedged.
-Paul Edmon-
On 10/22/2019 5:22 PM, Riebs,
A common reason for seeing this is if a process is dropping core -- the kernel
will ignore job kill requests until that is complete, so the job isn't being
killed as quickly as Slurm would like. I typically recommend increasing the
UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avo
Hi all,
I have a number of nodes on one of my 17.11.7 clusters in drain mode on account
of reason: "Kill task failed”
I see the following in slurmd.log —
[2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15
CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
[2019-10-17T2
Dear Chris,
I could not find this warning in the slurm.conf man page. So I googled
it and found a reference in the Slurm developers documentation:
https://slurm.schedmd.com/jobacct_gatherplugins.html
However, this web page says in its footer: "Last modified 27 March 2015".
So maybe (means: hop
Hi,
We've been using jobacct_gather/cgroup for quite some time and haven't had any
issues (I think). We do see some lengthy job cleanup times when there are lots
of small jobs completing at once, maybe that is due to the cgroup plugin. At
SLUG19 a slurm dev presented information that the jobacc
Hi,
I'm more used to GridEngine but use SLURM at remote locations. What I miss in
SLURM are two features which I think are related to this area:
• use job names with wildcards for the various command like `scancel`
• use job names with wildcards for --dependency
If I get you right, you would l
On Tue, Oct 22, 2019 at 12:06:57PM +0300, mercan wrote:
> Hi;
>
> You can use the "--dependency=afterok:jobid:jobid ..." parameter of the
> sbatch to ensure the new submitted job will be waiting until all older jobs
> are finished. Simply, you can submit the new job even while older jobs are
> run
Hi;
You can use the "--dependency=afterok:jobid:jobid ..." parameter of the
sbatch to ensure the new submitted job will be waiting until all older
jobs are finished. Simply, you can submit the new job even while older
jobs are running, the new job will not start before old jobs are finished.
Hi,
i am using slurm in a single node job batching system. Slurm ist perfect
for that case and works for a couple years flawlessly. Lately i was
shuffleing around jobs which take much longer to run to only run
daily, and other jobs to run more frequently.
A Question i had was - is there a possib
10 matches
Mail list logo