[slurm-users] Slurm Node Unresponsive

Grant Campbell Tue, 08 Sep 2020 11:14:28 -0700

Hey,

I am running a Slurm cluster that I inherited from an employee who left, so
you will have to forgive any ignorance on my part, I am still coming up to
speed on some core concepts.


I have a vexing issue where one slurm node becomes unresponsive
consistently. Network and DNS seem to be working fine, but the control node
logs "Nodes node3 not responding, setting DOWN ". If I mark the node as
RESUME it comes back up, but no jobs can be scheduled, I have to restart
the slurmd process to get it to work.

I enabled debug logging on the troublesome node,  and I see it logging
errors like the below near constantly:

[2020-09-08T09:02:35.189] [59921.0] error: Unable to establish controller
machine
[2020-09-08T09:02:40.584] [59924.0] error: Unable to establish controller
machine
[2020-09-08T09:03:02.550] [59923.extern] error: Unable to establish
controller machine
[2020-09-08T09:03:04.537] [59921.extern] error: Unable to establish
controller machine
[2020-09-08T09:03:09.474] [59924.extern] error: Unable to establish
controller machine

This of course seems problematic, though it should be noted I do not see
the logging of these errors correlate with the outage chronologically at
all -- as I said, they log near constantly.

One final piece of context, this machine OOM'd last week, and this issue
began after we brought it back up. As part of that process, I had to
re-join it to IPA, so not sure if there is something there that could have
caused this issue.

Any help or advice would be much appreciated, thanks!

Thanks!

-Grant

[slurm-users] Slurm Node Unresponsive

Reply via email to