Hi Patrick,
On 9/22/25 07:39, Patrick Begou via slurm-users wrote:
I also see twice a node reaching this "drain state" these last weeks. It
is the first time on this cluster (Slurm is 24.05 on the latest setup) and
I'm running slurm for many years (with Slurm 20.11 on the oldest cluster).
No user process found, I've just resumed the node.
This may happen when the job's I/O takes too long time, and the
UnkillableStepTimeout gets exceeded, but later on the I/O actually
completes and the user's processes ultimately exit.
It is informative to ask Slurm for any events on the affected nodes by
using the sacctmgr command:
$ sacctmgr show event
Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where
nodes=<nodenames>
This will show you why the node became "drained".
Default period of start of events is 00:00:00 of the previous day, but
this can be changed with the Start= option.
Best regards,
Ole
Le 19/09/2025 à 20:07, Ole Holm Nielsen via slurm-users a écrit :
On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
Is there any way to reset node to “state=idle” after errors in the
same way?
First you have to investigate if the jobid's user has any processes
left behind on the compute node. It may very well be stale I/O from
the job to a network file server.
It may also happen that the I/O was actually completed *after* Slurm
drained the node, and all user processes have completed. In this case
you may simply "resume" the node xxx:
$ scontrol update nodename=xxx state=resume
However, if stale user processes continue to exist, your only choice is
to reboot the node and tell Slurm to resume node xxx:
$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx
We just now had a "Kill task failed" event on a node which caused it to
drain, and Slurm Triggers then sent an E-mail alert to the sysadmin.
Logging in to the node I found a user process left behind after the
Slurm job had been killed:
$ ps auxw | sed /root/d
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
username 29160 97.4 1.3 13770416 10415916 ? D Sep17 2926:25 /
home/username/...
As you can see, the process state is "D". According to the "ps" manual
D means "uninterruptible sleep (usually IO)".
In this case the only possible fix is to reboot the node, thereby
forcibly terminating the frozen I/O on the network file server.
IHTH,
Ole
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]