Thanks for the update. We gonna try to build a new package and test it.
On 22/01/2019 15:30, Douglas Jacobsen wrote:
There were several related commits last week: https://github.com/SchedMD/slurm/commits/slurm-18.08On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen <dmjacob...@lbl.gov <mailto:dmjacob...@lbl.gov>> wrote:Hello, Yes it's a bug in the way the reboot rpcs are handled. A fix was recently committed which we have yet to test, but 18.08.5 is meant to repair this (among other things). Doug On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten <martijn.krui...@surfsara.nl <mailto:martijn.krui...@surfsara.nl>> wrote: Hi, We encounter a strange issue on our system (Slurm 18.08.3), and I'm curious whether anyone of you recognizes this behavior. In the following example we try to reboot 32 nodes, of which 31 nodes are idle: root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32] root# sinfo -o "%100E %9u %19H %N" REASON USER TIMESTAMP NODELIST image root 2019-01-21T17:03:49 r8n32 image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:47 r8n[1-3] image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:47 r8n[4-10] image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:48 r8n[11-15] image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:48 r8n[16-23] image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:49 r8n[24-29] image : reboot issued : reboot issued : reboot issued : reboot issued : reboot issued : reboot issue root 2019-01-21T17:03:49 r8n[30-31] For as long as the allocated node (r8n32) has not been rebooted, the "reboot issued" message keeps appending to the reason for all other nodes, and the ResumeTimeout is ignored. Even worse: the other nodes get stuck in an endless reboot loop. It seems like they keep getting the instruction to reboot. As soon as I cancel the reboot for the allocated node, the reboot loop stops for all other nodes. This also happens if we do the reboot command in a loop: root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume reason=image r8n$n; done So it seems that Slurm somehow groups all nodes that need to be rebooted together, and issues reboot commands to them until the last one of them is ready to reboot. This happens regardless of whether the scontrol command has been issued for all nodes at once or independently. I should add that the command works fine if we need to reboot just one node, or for couple of nodes that were already idle to begin with. The RebootProgram is /sbin/reboot, so nothing out of the ordinary. Best regards, Martijn Kruiten--| System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam | | T +31 6 20043417 | martijn.krui...@surfsara.nl <mailto:bas.vandervl...@surfsara.nl> | www.surfsara.nl <http://www.surfsara.nl> |-- Sent from Gmail Mobile-- Sent from Gmail Mobile
-- -- Bas van der Vlies| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam
| T +31 (0) 20 800 1300 | bas.vandervl...@surfsara.nl | www.surfsara.nl |
smime.p7s
Description: S/MIME Cryptographic Signature