Re: [slurm-users] nodes going to down* and getting stuck in that state

Brian Andrus Thu, 20 May 2021 09:27:21 -0700

Does it tell you the reason for it being down?

sinfo -R

I have seen where a node comes up, but the amount of memory slurmd seesis a little less than what was configured in slurm.conf.You should always set aside some of the memory when defining it inslurm.conf so you have memory for the operating system and so thingsdon't choke if it comes up a bit lower because some driver took morememory when it loaded.


Brian Andrus

On 5/19/2021 9:15 PM, Herc Silverstein wrote:

Hi,
We have a cluster (in Google gcp) which has a few partitions set up toauto-scale, but one partition is set up to not autoscale. The desiredstate is for all of the nodes in this non-autoscaled partition(SuspendExcParts=gpu-t4-4x-ondemand) to continue runninguninterrupted. However, we are finding that nodes periodically end upin the down* state and that we cannot get them back into a usablestate. This is using slurm 19.05.7
We have a script that runs periodically and checks the state of thenodes and takes action based on the state. If the node is in a downstate, then it gets terminated and if successfully terminated itsstate is set to power_down. There is a short 1 second pause and thenfor those nodes that are in the POWERING_DOWN and not drained statethey are set to RESUME.
Sometimes after we start up the node and it's running slurmd we cannotget some of these nodes back into a usable slurm state even aftermanually fiddling with its state. It seems to go between idle* anddown*. But the node is there and we can log into it.
Does anyone have an idea of what might be going on? And what we cando to get these nodes back into a usable (I guess "idle") state?
Thanks,

Herc

Re: [slurm-users] nodes going to down* and getting stuck in that state

Reply via email to