I think there are at least two possible ways to do what you want. You can make a reservation on the node and mark it as a maintenance reservation. I don't know if slurm will shut down the node if it is idle while it has a maintenance reservation, but it certainly won't if you also run a job as root to take the reservation. I've tried actually doing the upgrades in that job and have it reboot the node when the upgrade finishes, and that works quite well if the upgrade behaves.
The other way would be to kill slurmd on the node outright as soon as it has drained and then run the upgrade manually. With slurmd dead, slurmctld thinks the node is already offline and won't try to shut it down. On Tue, Sep 1, 2020 at 2:40 AM Steininger, Herbert <herbert_steinin...@psych.mpg.de> wrote: > > Hi Guys, > > > > Thanks for your answers. > > > > I would like not to patch the source code of Slurm, like Jacek does it, to > make things easier. > > But I think, it is the way to go. > > > > When I try the solutions, Florian and Angelos suggested, slurm will still > think that the nodes are "powered down", even if they not. > > Well, it is better that slurm only thinks that they are down, better as if > they will power down while upgrading something. > > > > > > What we really need is some state like "MAINT", for maintenance, which will > slurm tell, not to utilize the node but also don’t power down the node. > > > > Thanks, > > Herbert > > > > > > > > Von: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] Im Auftrag > von Florian Zillner > Gesendet: Mittwoch, 26. August 2020 10:36 > An: Slurm User Community List <slurm-users@lists.schedmd.com> > Betreff: Re: [slurm-users] [External] [slurm 20.02.3] don't suspend nodes in > down state > > > > Hi Herbert, > > > > just like Angelos described, we also have logic in our poweroff script that > checks if the node is really IDLE and only sends the poweroff command if > that's the case. > > > > Excerpt: > > hosts=$(scontrol show hostnames $1) > > for host in $hosts; do > > scontrol show node $host | tr ' ' '\n' | grep -q 'State=IDLE+POWER$' > > if [[ $? == 1 ]]; then > > echo "node $host NOT IDLE" >>$OUTFILE > > continue > > else > > echo "node $host IDLE" >>$OUTFILE > > fi > > ssh $host poweroff > > ... > > sleep 1 > > ... > > done > > > > Best, > > Florian > > > > ________________________________ > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Steininger, Herbert <herbert_steinin...@psych.mpg.de> > Sent: Monday, 24 August 2020 10:52 > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: [External] [slurm-users] [slurm 20.02.3] don't suspend nodes in down > state > > > > Hi, > > how can I prevent slurm, to suspend nodes, which I have set to down state for > maintenance? > I know about "SuspendExcNodes", but this doesn't seem the right way, to roll > out the slurm.conf every time this changes. > Is there a state that I can set so that the nodes doesn't get suspended? > > It happened a few times that I was doing some stuff on a server and after our > idle time (1h) slurm decided to suspend the node. > > TIA, > Herbert > > -- > Herbert Steininger > Leiter EDV & HPC > Administrator > Max-Planck-Institut für Psychiatrie > Kraepelinstr. 2-10 > 80804 München > Tel +49 (0)89 / 30622-368 > Mail herbert_steinin...@psych.mpg.de > Web https://www.psych.mpg.de > >