Dear all,

I am maintaining a small computing cluster and I have a weird behavior that I fail at debugging.

My cluster comprise one master node and 16 computing servers, organized in  two queues, each queue having 8 servers. All servers run up-to-date Debian bullseye. All but 3 servers work flawlessly.

From the master node, I can see that 3 servers on one of the queue appear down:

jtailleu@kandinsky:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Volume*      up   infinite      8  alloc FX[21-24,31-34]
Speed        up   infinite      3  down* FX[12-14]
Speed        up   infinite      4  alloc FX[41-44]
Speed        up   infinite      1   idle FX11

These servers are reachable by SSH/ping

jtailleu@kandinsky:~$ ping -c 1 FX12
PING FX12 (192.168.6.22) 56(84) bytes of data.
64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms

--- FX12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms

#####

I can also put these nodes back into idle mode:

root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle
root@kandinsky:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Volume*      up   infinite      8  alloc FX[21-24,31-34]
Speed        up   infinite      3  idle* FX[12-14]
Speed        up   infinite      4  alloc FX[41-44]
Speed        up   infinite      1   idle FX11

But then, they switch back into down mode few minutes later:

root@kandinsky:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Volume*      up   infinite      8  alloc FX[21-24,31-34]
Speed        up   infinite      3  down* FX[12-14]
Speed        up   infinite      4  alloc FX[41-44]
Speed        up   infinite      1   idle FX11

root@kandinsky:~# sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2025-09-08T15:04:39 FX[12-14]

I do not understand where the "not responding" comes from, nor how I can investigate that. Any idea what could trigger this behavior?

Best wishes,

Julien




--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to