[slurm-users] Apparent scontrol reboot bug

Martijn Kruiten Tue, 22 Jan 2019 02:22:35 -0800

Hi,

We encounter a strange issue on our system (Slurm 18.08.3), and I'm curious 
whether anyone of you recognizes this behavior. In the following example we try 
to reboot 32 nodes, of which 31 nodes are idle:


root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]
root# sinfo -o "%100E %9u %19H %N"
REASON                                                                          
                     USER      TIMESTAMP           NODELIST
image                                                                           
                     root      2019-01-21T17:03:49 r8n32
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot 
issued : reboot issue root      2019-01-21T17:03:47 r8n[1-3]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot 
issued : reboot issue root      2019-01-21T17:03:47 r8n[4-10]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot 
issued : reboot issue root      2019-01-21T17:03:48 r8n[11-15]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot 
issued : reboot issue root      2019-01-21T17:03:48 r8n[16-23]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot 
issued : reboot issue root      2019-01-21T17:03:49 r8n[24-29]
image : reboot issued : reboot issued : reboot issued : reboot issued : reboot 
issued : reboot issue root      2019-01-21T17:03:49 r8n[30-31]

For as long as the allocated node (r8n32) has not been rebooted, the "reboot 
issued" message keeps appending to the reason for all other nodes, and the 
ResumeTimeout is ignored. Even worse: the other nodes get stuck in an endless 
reboot loop. It seems like they keep getting the instruction to reboot. As soon 
as I cancel the reboot for the allocated node, the reboot loop stops for all 
other nodes. 

This also happens if we do the reboot command in a loop:

root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume reason=image 
r8n$n; done

So it seems that Slurm somehow groups all nodes that need to be rebooted 
together, and issues reboot commands to them until the last one of them is 
ready to reboot. This happens regardless of whether the scontrol command has 
been issued for all nodes at once or independently.

I should add that the command works fine if we need to reboot just one node, or 
for couple of nodes that were already idle to begin with. The RebootProgram is 
/sbin/reboot, so nothing out of the ordinary.

Best regards,
Martijn Kruiten
-- 
| System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam |
| T +31 6 20043417  | martijn.krui...@surfsara.nl | www.surfsara.nl |

[slurm-users] Apparent scontrol reboot bug

Reply via email to