The SLURM controller AND all the compute nodes need to know who all is in the cluster. If you want to add a node or it changes IP addresses, you need to let all the nodes know about this which, for me, usually means restarting slurmd on the compute nodes.
I just say this because I get caught by this all the time if I add some nodes and for whatever reason miss restarting one of the slurmd processes on the compute nodes. Tim On Wed, May 19, 2021 at 9:17 PM Herc Silverstein < herc.silverst...@schrodinger.com> wrote: > Hi, > > We have a cluster (in Google gcp) which has a few partitions set up to > auto-scale, but one partition is set up to not autoscale. The desired > state is for all of the nodes in this non-autoscaled partition > (SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted. > However, we are finding that nodes periodically end up in the down* > state and that we cannot get them back into a usable state. This is > using slurm 19.05.7 > > We have a script that runs periodically and checks the state of the > nodes and takes action based on the state. If the node is in a down > state, then it gets terminated and if successfully terminated its state > is set to power_down. There is a short 1 second pause and then for > those nodes that are in the POWERING_DOWN and not drained state they are > set to RESUME. > > Sometimes after we start up the node and it's running slurmd we cannot > get some of these nodes back into a usable slurm state even after > manually fiddling with its state. It seems to go between idle* and > down*. But the node is there and we can log into it. > > Does anyone have an idea of what might be going on? And what we can do > to get these nodes back into a usable (I guess "idle") state? > > Thanks, > > Herc > > > >