[slurm-users] Re: Node switching randomly to down state

Davide DelVento via slurm-users Sat, 18 Oct 2025 05:47:32 -0700

As the great Ole just taught us in another thread, this should tell you why:


sacctmgr show event
Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where
nodes=FX[12-14]

However I suspect you'd only get "not responding" again ;-)

Are you sure that all the slurm services are running correctly on
those servers? Maybe try rebooting them?




On Tue, Sep 23, 2025 at 12:15 PM Julien Tailleur via slurm-users <
[email protected]> wrote:

> Dear all,
>
> I am maintaining a small computing cluster and I have a weird behavior
> that I fail at debugging.
>
> My cluster comprise one master node and 16 computing servers, organized
> in  two queues, each queue having 8 servers. All servers run up-to-date
> Debian bullseye. All but 3 servers work flawlessly.
>
>  From the master node, I can see that 3 servers on one of the queue
> appear down:
>
> jtailleu@kandinsky:~$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> Volume*      up   infinite      8  alloc FX[21-24,31-34]
> Speed        up   infinite      3  down* FX[12-14]
> Speed        up   infinite      4  alloc FX[41-44]
> Speed        up   infinite      1   idle FX11
>
> These servers are reachable by SSH/ping
>
> jtailleu@kandinsky:~$ ping -c 1 FX12
> PING FX12 (192.168.6.22) 56(84) bytes of data.
> 64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms
>
> --- FX12 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms
>
> #####
>
> I can also put these nodes back into idle mode:
>
> root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle
> root@kandinsky:~# sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> Volume*      up   infinite      8  alloc FX[21-24,31-34]
> Speed        up   infinite      3  idle* FX[12-14]
> Speed        up   infinite      4  alloc FX[41-44]
> Speed        up   infinite      1   idle FX11
>
> But then, they switch back into down mode few minutes later:
>
> root@kandinsky:~# sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> Volume*      up   infinite      8  alloc FX[21-24,31-34]
> Speed        up   infinite      3  down* FX[12-14]
> Speed        up   infinite      4  alloc FX[41-44]
> Speed        up   infinite      1   idle FX11
>
> root@kandinsky:~# sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Not responding       slurm     2025-09-08T15:04:39 FX[12-14]
>
> I do not understand where the "not responding" comes from, nor how I can
> investigate that. Any idea what could trigger this behavior?
>
> Best wishes,
>
> Julien
>
>
>
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Node switching randomly to down state

Reply via email to