[slurm-users] Re: Node switching randomly to down state

Julien Tailleur via slurm-users Tue, 23 Sep 2025 15:39:30 -0700

On 9/23/25 16:44, Davide DelVento wrote:

As the great Ole just taught us in another thread, this should tellyou why:
sacctmgr show eventFormat=NodeName,TimeStart,Duration,State%-6,Reason%-40,User wherenodes=FX[12-14]
However I suspect you'd only get "not responding" again ;-)


Good prediction!

sacctmgr show eventFormat=NodeName,TimeStart,Duration,State%-6,Reason%-40,User NodeName TimeStart DurationState Reason User--------------- ------------------- ------------- ---------------------------------------------- ---------- 2021-08-25T11:13:56 1490-12:21:12 ClusterRegistered TRESFX12 2025-09-08T15:04:39 15-08:30:29 DOWN* Notresponding slurm(640+FX13 2025-09-08T15:04:39 15-08:30:29 DOWN* Notresponding slurm(640+FX14 2025-09-08T15:04:39 15-08:30:29 DOWN* Notresponding slurm(640+

Are you sure that all the slurm services are running correctly onthose servers? Maybe try rebooting them?

The service were all running. "Correctly" is harder to say :-) I did notsee anything obviously interesting in the logs, but I am not sure whatto look for.

Anyway, I've followed your advice and rebooted the servers and they areidle for now. I will see how long it lasts. If that fixed it, I willfall on my sword and apologize for disturbing the ML...


Best,

Julien

--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Node switching randomly to down state

Reply via email to