On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote: > i was wondering why a node is drained when killing of task fails and how can > i disable it? (i use cgroups) moreover, how can the killing of task fails? > (this is on slurm 19.05)
Slurm has tried to kill processes, but they refuse to go away. Usually this means they're stuck in a device or I/O wait for some reason, so look for processes that are in a "D" state on the node. As others have said they can be stuck writing out large files and waiting for the kernel to complete that before they exit. This can also happen if you're using GPUs and something has gone wrong in the driver and the process is stuck in the kernel somewhere. You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the kernel reports tasks stuck and where they are stuck. If there are tasks stuck in that state then often the only recourse is to reboot the node back into health. You can tell Slurm to run a program on the node should it find itself in this state, see: https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA