Re: [slurm-users] nodes going to down* and getting stuck in that state

Tim Carlson Thu, 20 May 2021 19:28:00 -0700

The SLURM controller AND all the compute nodes need to know who all is in
the cluster. If you want to add a node or it changes IP addresses, you need
to let all the nodes know about this which, for me, usually means
restarting slurmd on the compute nodes.


I just say this because I get caught by this all the time if I add some
nodes and for whatever reason miss restarting one of the slurmd processes
on the compute nodes.

Tim

On Wed, May 19, 2021 at 9:17 PM Herc Silverstein <
herc.silverst...@schrodinger.com> wrote:

> Hi,
>
> We have a cluster (in Google gcp) which has a few partitions set up to
> auto-scale, but one partition is set up to not autoscale. The desired
> state is for all of the nodes in this non-autoscaled partition
> (SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.
> However, we are finding that nodes periodically end up in the down*
> state and that we cannot get them back into a usable state.  This is
> using slurm 19.05.7
>
> We have a script that runs periodically and checks the state of the
> nodes and takes action based on the state.  If the node is in a down
> state, then it gets terminated and if successfully terminated its state
> is set to power_down.  There is a short 1 second pause and then for
> those nodes that are in the POWERING_DOWN and not drained state they are
> set to RESUME.
>
> Sometimes after we start up the node and it's running slurmd we cannot
> get some of these nodes back into a usable slurm state even after
> manually fiddling with its state.   It seems to go between idle* and
> down*.  But the node is there and we can log into it.
>
> Does anyone have an idea of what might be going on?  And what we can do
> to get these nodes back into a usable (I guess "idle") state?
>
> Thanks,
>
> Herc
>
>
>
>

Re: [slurm-users] nodes going to down* and getting stuck in that state

Reply via email to