Does it tell you the reason for it being down?
sinfo -R
I have seen where a node comes up, but the amount of memory slurmd sees
is a little less than what was configured in slurm.conf.
You should always set aside some of the memory when defining it in
slurm.conf so you have memory for the operating system and so things
don't choke if it comes up a bit lower because some driver took more
memory when it loaded.
Brian Andrus
On 5/19/2021 9:15 PM, Herc Silverstein wrote:
Hi,
We have a cluster (in Google gcp) which has a few partitions set up to
auto-scale, but one partition is set up to not autoscale. The desired
state is for all of the nodes in this non-autoscaled partition
(SuspendExcParts=gpu-t4-4x-ondemand) to continue running
uninterrupted. However, we are finding that nodes periodically end up
in the down* state and that we cannot get them back into a usable
state. This is using slurm 19.05.7
We have a script that runs periodically and checks the state of the
nodes and takes action based on the state. If the node is in a down
state, then it gets terminated and if successfully terminated its
state is set to power_down. There is a short 1 second pause and then
for those nodes that are in the POWERING_DOWN and not drained state
they are set to RESUME.
Sometimes after we start up the node and it's running slurmd we cannot
get some of these nodes back into a usable slurm state even after
manually fiddling with its state. It seems to go between idle* and
down*. But the node is there and we can log into it.
Does anyone have an idea of what might be going on? And what we can
do to get these nodes back into a usable (I guess "idle") state?
Thanks,
Herc