As the great Ole just taught us in another thread, this should tell you why:
sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where nodes=FX[12-14] However I suspect you'd only get "not responding" again ;-) Are you sure that all the slurm services are running correctly on those servers? Maybe try rebooting them? On Tue, Sep 23, 2025 at 12:15 PM Julien Tailleur via slurm-users < [email protected]> wrote: > Dear all, > > I am maintaining a small computing cluster and I have a weird behavior > that I fail at debugging. > > My cluster comprise one master node and 16 computing servers, organized > in two queues, each queue having 8 servers. All servers run up-to-date > Debian bullseye. All but 3 servers work flawlessly. > > From the master node, I can see that 3 servers on one of the queue > appear down: > > jtailleu@kandinsky:~$ sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > Volume* up infinite 8 alloc FX[21-24,31-34] > Speed up infinite 3 down* FX[12-14] > Speed up infinite 4 alloc FX[41-44] > Speed up infinite 1 idle FX11 > > These servers are reachable by SSH/ping > > jtailleu@kandinsky:~$ ping -c 1 FX12 > PING FX12 (192.168.6.22) 56(84) bytes of data. > 64 bytes from FX12 (192.168.6.22): icmp_seq=1 ttl=64 time=0.070 ms > > --- FX12 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms > > ##### > > I can also put these nodes back into idle mode: > > root@kandinsky:~# scontrol update nodename=FX[12-14] state=idle > root@kandinsky:~# sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > Volume* up infinite 8 alloc FX[21-24,31-34] > Speed up infinite 3 idle* FX[12-14] > Speed up infinite 4 alloc FX[41-44] > Speed up infinite 1 idle FX11 > > But then, they switch back into down mode few minutes later: > > root@kandinsky:~# sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > Volume* up infinite 8 alloc FX[21-24,31-34] > Speed up infinite 3 down* FX[12-14] > Speed up infinite 4 alloc FX[41-44] > Speed up infinite 1 idle FX11 > > root@kandinsky:~# sinfo -R > REASON USER TIMESTAMP NODELIST > Not responding slurm 2025-09-08T15:04:39 FX[12-14] > > I do not understand where the "not responding" comes from, nor how I can > investigate that. Any idea what could trigger this behavior? > > Best wishes, > > Julien > > > > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
