Re: [slurm-users] Apparent scontrol reboot bug

Bas van der Vlies Tue, 22 Jan 2019 08:14:19 -0800


Thanks for the update. We gonna try to build a new package and test it.


On 22/01/2019 15:30, Douglas Jacobsen wrote:

There were several related commits last week:
https://github.com/SchedMD/slurm/commits/slurm-18.08

On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen <dmjacob...@lbl.gov <mailto:dmjacob...@lbl.gov>> wrote:


    Hello,

    Yes it's a bug in the way the reboot rpcs are handled.  A fix was
    recently committed which we have yet to test, but 18.08.5 is meant
    to repair this (among other things).

    Doug

    On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten
    <martijn.krui...@surfsara.nl <mailto:martijn.krui...@surfsara.nl>>
    wrote:

        Hi,

        We encounter a strange issue on our system (Slurm 18.08.3), and
        I'm curious whether anyone of you recognizes this behavior. In
        the following example we try to reboot 32 nodes, of which 31
        nodes are idle:

        root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32]
        root# sinfo -o "%100E %9u %19H %N"
        REASON                                                                  
                             USER      TIMESTAMP           NODELIST
        image                                                                   
                             root      2019-01-21T17:03:49
        r8n32
        image : reboot issued : reboot issued : reboot issued : reboot
        issued : reboot issued : reboot issue
        root      2019-01-21T17:03:47 r8n[1-3]
        image : reboot issued : reboot issued : reboot issued : reboot
        issued : reboot issued : reboot issue
        root      2019-01-21T17:03:47 r8n[4-10]
        image : reboot issued : reboot issued : reboot issued : reboot
        issued : reboot issued : reboot issue
        root      2019-01-21T17:03:48 r8n[11-15]
        image : reboot issued : reboot issued : reboot issued : reboot
        issued : reboot issued : reboot issue
        root      2019-01-21T17:03:48 r8n[16-23]
        image : reboot issued : reboot issued : reboot issued : reboot
        issued : reboot issued : reboot issue
        root      2019-01-21T17:03:49 r8n[24-29]
        image : reboot issued : reboot issued : reboot issued : reboot
        issued : reboot issued : reboot issue
        root      2019-01-21T17:03:49 r8n[30-31]

        For as long as the allocated node (r8n32) has not been rebooted,
        the "reboot issued" message keeps appending to the reason for
        all other nodes, and the ResumeTimeout is ignored. Even worse:
        the other nodes get stuck in an endless reboot loop. It seems
        like they keep getting the instruction to reboot. As soon as I
        cancel the reboot for the allocated node, the reboot loop stops
        for all other nodes.

        This also happens if we do the reboot command in a loop:

        root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume
        reason=image r8n$n; done

        So it seems that Slurm somehow groups all nodes that need to be
        rebooted together, and issues reboot commands to them until the
        last one of them is ready to reboot. This happens regardless of
        whether the scontrol command has been issued for all nodes at
        once or independently.

        I should add that the command works fine if we need to reboot
        just one node, or for couple of nodes that were already idle to
        begin with. The RebootProgram is /sbin/reboot, so nothing out of
        the ordinary.

        Best regards,

        Martijn Kruiten

        | System Programmer | SURFsara | Science Park 140 | 1098
        XG Amsterdam |
        | T +31 6 20043417  | martijn.krui...@surfsara.nl
        <mailto:bas.vandervl...@surfsara.nl> | www.surfsara.nl
        <http://www.surfsara.nl> |

-- Sent from Gmail Mobile


--
Sent from Gmail Mobile


--
--
Bas van der Vlies

| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [slurm-users] Apparent scontrol reboot bug

Reply via email to