[slurm-users] Re: Node in drain state

Ole Holm Nielsen via slurm-users Tue, 16 Sep 2025 00:05:48 -0700

On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:

[root@login-node ~]# sinfo
PARTITION TIMELIMIT AVAIL STATENODELIST CPU_LOAD NODES(A/I) NODES(A/I/O/T) CPUS CPUS(A/I/O/T) REASON
*node.q* 4:00:00 up drainedclus09 0.00 0/00/0/1/1 12 0/0/12/12 Kill task faile*

The *Kill task failed* reason is due to the UnkillableStepTimeout [1]configuration:

The length of time, in seconds, that Slurm will wait before deciding that 
processes in a job step are unkillable (after they have been signaled with 
SIGKILL) and execute UnkillableStepProgram. The default timeout value is 60 
seconds or five times the value of MessageTimeout, whichever is greater. If 
exceeded, the compute node will be drained to prevent future jobs from being 
scheduled on the node.

But it seems there is no error in node... Slurmctld.log in server seemscorrect, too.

The slurmctld won't have any errors. The node has errors due toUnkillableStepTimeout and therefore Slurm has drained it.

Is there any way to reset node to “state=idle” after errors in the same way?

First you have to investigate if the jobid's user has any processes leftbehind on the compute node. It may very well be stale I/O from the job toa network file server.

It may also happen that the I/O was actually completed *after* Slurmdrained the node, and all user processes have completed. In this case youmay simply "resume" the node xxx:


$ scontrol update nodename=xxx state=resume

However, if stale user processes continue to exist, your only choice is toreboot the node and tell Slurm to resume node xxx:


$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx

IHTH,
Ole

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout


--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Node in drain state

Reply via email to