Look at the slurmd logs on these nodes. Or try to run slurmd in non background mode.
And as I said on another thread check the time on these nodes On Tue, Sep 23, 2025, 11:41 PM Julien Tailleur via slurm-users < [email protected]> wrote: > On 9/23/25 16:44, Davide DelVento wrote: > > As the great Ole just taught us in another thread, this should tell > > you why: > > > > sacctmgr show event > > Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where > > nodes=FX[12-14] > > > > However I suspect you'd only get "not responding" again ;-) > > Good prediction! > > sacctmgr show event > Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User > NodeName TimeStart Duration > State Reason User > --------------- ------------------- ------------- ------ > ---------------------------------------- ---------- > 2021-08-25T11:13:56 1490-12:21:12 Cluster > Registered TRES > FX12 2025-09-08T15:04:39 15-08:30:29 DOWN* Not > responding slurm(640+ > FX13 2025-09-08T15:04:39 15-08:30:29 DOWN* Not > responding slurm(640+ > FX14 2025-09-08T15:04:39 15-08:30:29 DOWN* Not > responding slurm(640+ > > > Are you sure that all the slurm services are running correctly on > > those servers? Maybe try rebooting them? > > The service were all running. "Correctly" is harder to say :-) I did not > see anything obviously interesting in the logs, but I am not sure what > to look for. > > Anyway, I've followed your advice and rebooted the servers and they are > idle for now. I will see how long it lasts. If that fixed it, I will > fall on my sword and apologize for disturbing the ML... > > Best, > > Julien > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
