[slurm-users] Re: Node in drain state

Ole Holm Nielsen via slurm-users Fri, 19 Sep 2025 11:08:47 -0700

On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
Is there any way to reset node to “state=idle” after errors in thesame way?
First you have to investigate if the jobid's user has any processes leftbehind on the compute node. It may very well be stale I/O from the jobto a network file server.
It may also happen that the I/O was actually completed *after* Slurmdrained the node, and all user processes have completed. In this caseyou may simply "resume" the node xxx:
$ scontrol update nodename=xxx state=resume
However, if stale user processes continue to exist, your only choice isto reboot the node and tell Slurm to resume node xxx:
$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx

We just now had a "Kill task failed" event on a node which caused it todrain, and Slurm Triggers then sent an E-mail alert to the sysadmin.

Logging in to the node I found a user process left behind after theSlurm job had been killed:


$ ps auxw | sed /root/d
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

username 29160 97.4 1.3 13770416 10415916 ? D Sep17 2926:25/home/username/...

As you can see, the process state is "D". According to the "ps" manualD means "uninterruptible sleep (usually IO)".

In this case the only possible fix is to reboot the node, therebyforcibly terminating the frozen I/O on the network file server.


IHTH,
Ole

--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Node in drain state

Reply via email to