Hey, I am running a Slurm cluster that I inherited from an employee who left, so you will have to forgive any ignorance on my part, I am still coming up to speed on some core concepts.
I have a vexing issue where one slurm node becomes unresponsive consistently. Network and DNS seem to be working fine, but the control node logs "Nodes node3 not responding, setting DOWN ". If I mark the node as RESUME it comes back up, but no jobs can be scheduled, I have to restart the slurmd process to get it to work. I enabled debug logging on the troublesome node, and I see it logging errors like the below near constantly: [2020-09-08T09:02:35.189] [59921.0] error: Unable to establish controller machine [2020-09-08T09:02:40.584] [59924.0] error: Unable to establish controller machine [2020-09-08T09:03:02.550] [59923.extern] error: Unable to establish controller machine [2020-09-08T09:03:04.537] [59921.extern] error: Unable to establish controller machine [2020-09-08T09:03:09.474] [59924.extern] error: Unable to establish controller machine This of course seems problematic, though it should be noted I do not see the logging of these errors correlate with the outage chronologically at all -- as I said, they log near constantly. One final piece of context, this machine OOM'd last week, and this issue began after we brought it back up. As part of that process, I had to re-join it to IPA, so not sure if there is something there that could have caused this issue. Any help or advice would be much appreciated, thanks! Thanks! -Grant