There were several related commits last week: https://github.com/SchedMD/slurm/commits/slurm-18.08
On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen <dmjacob...@lbl.gov> wrote: > Hello, > > Yes it's a bug in the way the reboot rpcs are handled. A fix was recently > committed which we have yet to test, but 18.08.5 is meant to repair this > (among other things). > > Doug > > On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten <martijn.krui...@surfsara.nl> > wrote: > >> Hi, >> >> We encounter a strange issue on our system (Slurm 18.08.3), and I'm >> curious whether anyone of you recognizes this behavior. In the following >> example we try to reboot 32 nodes, of which 31 nodes are idle: >> >> root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32] >> root# sinfo -o "%100E %9u %19H %N" >> >> REASON >> USER TIMESTAMP NODELIST >> image >> root 2019-01-21T17:03:49 >> r8n32 >> image : reboot issued : reboot issued : reboot issued : reboot issued : >> reboot issued : reboot issue root 2019-01-21T17:03:47 r8n[1-3] >> image : reboot issued : reboot issued : reboot issued : reboot issued : >> reboot issued : reboot issue root 2019-01-21T17:03:47 r8n[4-10] >> image : reboot issued : reboot issued : reboot issued : reboot issued : >> reboot issued : reboot issue root 2019-01-21T17:03:48 r8n[11-15] >> image : reboot issued : reboot issued : reboot issued : reboot issued : >> reboot issued : reboot issue root 2019-01-21T17:03:48 r8n[16-23] >> image : reboot issued : reboot issued : reboot issued : reboot issued : >> reboot issued : reboot issue root 2019-01-21T17:03:49 r8n[24-29] >> image : reboot issued : reboot issued : reboot issued : reboot issued : >> reboot issued : reboot issue root 2019-01-21T17:03:49 r8n[30-31] >> >> For as long as the allocated node (r8n32) has not been rebooted, the >> "reboot issued" message keeps appending to the reason for all other nodes, >> and the ResumeTimeout is ignored. Even worse: the other nodes get stuck in >> an endless reboot loop. It seems like they keep getting the instruction to >> reboot. As soon as I cancel the reboot for the allocated node, the reboot >> loop stops for all other nodes. >> >> This also happens if we do the reboot command in a loop: >> >> root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume >> reason=image r8n$n; done >> >> So it seems that Slurm somehow groups all nodes that need to be rebooted >> together, and issues reboot commands to them until the last one of them is >> ready to reboot. This happens regardless of whether the scontrol command >> has been issued for all nodes at once or independently. >> >> I should add that the command works fine if we need to reboot just one >> node, or for couple of nodes that were already idle to begin with. The >> RebootProgram is /sbin/reboot, so nothing out of the ordinary. >> >> Best regards, >> >> Martijn Kruiten >> >> -- >> >> | System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam | >> | T +31 6 20043417 | martijn.krui...@surfsara.nl >> <bas.vandervl...@surfsara.nl> | www.surfsara.nl | >> > -- > Sent from Gmail Mobile > -- Sent from Gmail Mobile