On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
Is there any way to reset node to “state=idle” after errors in the
same way?
First you have to investigate if the jobid's user has any processes left
behind on the compute node. It may very well be stale I/O from the job
to a network file server.
It may also happen that the I/O was actually completed *after* Slurm
drained the node, and all user processes have completed. In this case
you may simply "resume" the node xxx:
$ scontrol update nodename=xxx state=resume
However, if stale user processes continue to exist, your only choice is
to reboot the node and tell Slurm to resume node xxx:
$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx
We just now had a "Kill task failed" event on a node which caused it to
drain, and Slurm Triggers then sent an E-mail alert to the sysadmin.
Logging in to the node I found a user process left behind after the
Slurm job had been killed:
$ ps auxw | sed /root/d
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
username 29160 97.4 1.3 13770416 10415916 ? D Sep17 2926:25
/home/username/...
As you can see, the process state is "D". According to the "ps" manual
D means "uninterruptible sleep (usually IO)".
In this case the only possible fix is to reboot the node, thereby
forcibly terminating the frozen I/O on the network file server.
IHTH,
Ole
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]