On Tuesday, 28 May 2019 9:03:16 AM PDT Matthew BETTINGER wrote:
> We use triggers for the obvious alerts but is that a way to make a trigger
> for nodes stuck in CG (completing) state? Some user jobs, mostly Julia
> notebook can get hung in completing state is the user kills the running job
> or
Ok thanks we will look into that! Thought we were the only ones who had the
problem and yes it's like windows 98SE, you can try all you want but
eventually we end up rebooting the nodes. Interns are starting to show up and
you know they can bend a cluster in ways you never seen before. We wi
Hi,
Check the UnkillableStepProgram and UnkillableStepTimeout options in
slurm.conf.
We use it to drain the stuck nodes and mail us - as here, usually stuck
processes will require a reboot. As the drained strigger will never get
triggered, we also set a finished trigger for the next RUNNING job. T
Hi;
If you did not use the epilog script, you can set the epilog script to
clean up all residues from the finished jobs:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts
Ahmet M.
28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:
We use triggers f
We use triggers for the obvious alerts but is that a way to make a trigger for
nodes stuck in CG (completing) state? Some user jobs, mostly Julia notebook
can get hung in completing state is the user kills the running job or cancels
it with cntrl. When this happens we can have many many nodes