On 9/23/25 16:44, Davide DelVento wrote:
As the great Ole just taught us in another thread, this should tell you why:

sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User where nodes=FX[12-14]

However I suspect you'd only get "not responding" again ;-)

Good prediction!

sacctmgr show event Format=NodeName,TimeStart,Duration,State%-6,Reason%-40,User        NodeName           TimeStart      Duration State                                   Reason       User --------------- ------------------- ------------- ------ ---------------------------------------- ----------                 2021-08-25T11:13:56 1490-12:21:12        Cluster Registered TRES FX12            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not responding                           slurm(640+ FX13            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not responding                           slurm(640+ FX14            2025-09-08T15:04:39   15-08:30:29 DOWN*  Not responding                           slurm(640+

Are you sure that all the slurm services are running correctly on those servers? Maybe try rebooting them?

The service were all running. "Correctly" is harder to say :-) I did not see anything obviously interesting in the logs, but I am not sure what to look for.

Anyway, I've followed your advice and rebooted the servers and they are idle for now. I will see how long it lasts. If that fixed it, I will fall on my sword and apologize for disturbing the ML...

Best,

Julien

--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to