First place to look IMO would be confirming connectivity on the Slurm-related ports (eg. firewall issue). I notice this is especially true when you see it work for a little while and then stop after some period of time.
Log may also tell you what’s going on. On Sep 23, 2025, at 14:13, Julien Tailleur via slurm-users <[email protected]> wrote: Dear all, I am maintaining a small computing cluster and I have a weird behavior that I fail at debugging. My cluster comprise one master node and 16 computing servers, organized in two queues, each queue having 8 servers. All servers run up-to-date Debian bullseye. All but 3 servers work flawlessly. From the master node, I can see that 3 servers on one of the queue appear down: jtailleu@kandinsky:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 These servers are reachable by SSH/ping jtailleu@kandinsky:~$ ping -c 1 FX12 PING FX12 (192.168.6.22) 56(84) bytes of data. 64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms --- FX12 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms ##### I can also put these nodes back into idle mode: root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 idle* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 But then, they switch back into down mode few minutes later: root@kandinsky:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Volume* up infinite 8 alloc FX[21-24,31-34] Speed up infinite 3 down* FX[12-14] Speed up infinite 4 alloc FX[41-44] Speed up infinite 1 idle FX11 root@kandinsky:~# sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2025-09-08T15:04:39 FX[12-14] I do not understand where the "not responding" comes from, nor how I can investigate that. Any idea what could trigger this behavior? Best wishes, Julien -- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
