On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
Is there any way to reset node to “state=idle” after errors in the same way?

First you have to investigate if the jobid's user has any processes left behind on the compute node.  It may very well be stale I/O from the job to a network file server.

It may also happen that the I/O was actually completed *after* Slurm drained the node, and all user processes have completed.  In this case you may simply "resume" the node xxx:

$ scontrol update nodename=xxx state=resume

However, if stale user processes continue to exist, your only choice is to reboot the node and tell Slurm to resume node xxx:

$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx

We just now had a "Kill task failed" event on a node which caused it to drain, and Slurm Triggers then sent an E-mail alert to the sysadmin.

Logging in to the node I found a user process left behind after the Slurm job had been killed:

$ ps auxw | sed /root/d
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
username 29160 97.4 1.3 13770416 10415916 ? D Sep17 2926:25 /home/username/...

As you can see, the process state is "D". According to the "ps" manual D means "uninterruptible sleep (usually IO)".

In this case the only possible fix is to reboot the node, thereby forcibly terminating the frozen I/O on the network file server.

IHTH,
Ole

--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to