Re: [slurm-users] Down nodes

Brian Andrus Fri, 30 Jul 2021 16:54:57 -0700

That 'not responding' is the issue and usually means 1 of 2 things:


1) slurmd is not running on the node

2) something on the network is stopping the communication between thenode and the master (firewall, selinux, congestion, bad nic, routes, etc)


Brian Andrus

On 7/30/2021 3:51 PM, Soichi Hayashi wrote:

Brian,

Thank you for your reply and thanks for setting the email title. Iforgot to edit it before I sent it!

I am not sure how I can reply to your your reply.. but I hope thismake it so the right place..


I've updated slurm.conf to increase the controller debug level
> SlurmctldDebug=5

I now see additional log output (debug).

[2021-07-30T22:42:05.255] debug: Spawning ping agent forslurm4-compute[2-6,10,12-14][2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30]not responding, setting DOWN

It's still very sparse, but it looks like slurm is trying to pingnodes that are already removed (they don't exist anymore - as they areremoved by slurm_suspend.sh script)


I tried sinfo -R but it doesn't really give much info..

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST

Not responding slurm 2021-07-30T22:42:05slurm4-compute[9,15,19-22,30]


These machines are gone, so it should not respond.

$ ping slurm4-compute9
ping: slurm4-compute9: Name or service not known

This is expected.

Why is slurm keeps trying to contact the node that's already removed?slurm_suspend.sh does the following to "remove" node from the partition.

> scontrol update nodename=${host} nodeaddr="(null)"

Maybe this isn't the correct way to do it? Is there a way to forceslurm to forget about the node? I tried "scontrol update node=$nodestate=idle", but this only works for a few minutes until slurm's pingagent kicks in and marking them down again.


Thanks!!
Soichi

On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi <hayas...@iu.edu<mailto:hayas...@iu.edu>> wrote:


    Hello. I need a help with troubleshooting our slurm cluster.

    I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
    infrastructure (Jetstream) using an elastic computing
    mechanism (https://slurm.schedmd.com/elastic_computing.html
    <https://slurm.schedmd.com/elastic_computing.html>). Our cluster
    works for the most part, but for some reason, a few of our nodes
    constantly goes into "down" state.

    PARTITION AVAIL  TIMELIMIT JOB_SIZE ROOT OVERSUBS     GROUPS
     NODES       STATE NODELIST
    cloud*       up 2-00:00:00 1-infinite   no    YES:4        all    
    10       idle~ slurm9-compute[1-5,10,12-15]
    cloud*       up 2-00:00:00 1-infinite   no    YES:4        all    
     5        down slurm9-compute[6-9,11]

    The only log I see in the slurm log is this..

    [2021-07-30T15:10:55.889] Invalid node state transition requested
    for node slurm9-compute6 from=COMPLETING to=RESUME
    [2021-07-30T15:21:37.339] Invalid node state transition requested
    for node slurm9-compute6 from=COMPLETING* to=RESUME
    [2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason
    set to: completing
    [2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state
    set to DOWN
    [2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state
    set to IDLE
    ..
    [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
    responding, setting DOWN

    WIth elastic computing, any unused nodes are automatically removed
    (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
    *expected* to not respond once they are removed, but they should
    not be marked as DOWN. They should simply be set to "idle".

    To work around this issue, I am running the following cron job.

    0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume

    This "works" somewhat.. but our nodes go to "DOWN" state so often
    that running this every hour is not enough.

    Here is the full content of our slurm.conf

    root@slurm9:~# cat /etc/slurm-llnl/slurm.conf
    ClusterName=slurm9
    ControlMachine=slurm9

    SlurmUser=slurm
    SlurmdUser=root
    SlurmctldPort=6817
    SlurmdPort=6818
    AuthType=auth/munge
    StateSaveLocation=/tmp
    SlurmdSpoolDir=/tmp/slurmd
    SwitchType=switch/none
    MpiDefault=none
    SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
    SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
    ProctrackType=proctrack/pgid
    ReturnToService=1
    Prolog=/usr/local/sbin/slurm_prolog.sh

    #
    # TIMERS
    SlurmctldTimeout=300
    SlurmdTimeout=300
    #make slurm a little more tolerant here
    MessageTimeout=30
    TCPTimeout=15
    BatchStartTimeout=20
    GetEnvTimeout=20
    InactiveLimit=0
    MinJobAge=604800
    KillWait=30
    Waittime=0
    #
    # SCHEDULING
    SchedulerType=sched/backfill
    SelectType=select/cons_res
    SelectTypeParameters=CR_CPU_Memory
    #FastSchedule=0

    # LOGGING
    SlurmctldDebug=3
    SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
    SlurmdDebug=3
    SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
    JobCompType=jobcomp/none

    # ACCOUNTING
    JobAcctGatherType=jobacct_gather/linux
    JobAcctGatherFrequency=30

    AccountingStorageType=accounting_storage/filetxt
    AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log

    #CLOUD CONFIGURATION
    PrivateData=cloud
    ResumeProgram=/usr/local/sbin/slurm_resume.sh
    SuspendProgram=/usr/local/sbin/slurm_suspend.sh
    ResumeRate=1 #number of nodes per minute that can be created; 0
    means no limit
    ResumeTimeout=900 #max time in seconds between ResumeProgram
    running and when the node is ready for use
    SuspendRate=1 #number of nodes per minute that can be
    suspended/destroyed
    SuspendTime=600 #time in seconds before an idle node is suspended
    SuspendTimeout=300 #time between running SuspendProgram and the
    node being completely down
    TreeWidth=30

    NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388
    PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES
    MaxTime=48:00:00 State=UP Shared=YES

    I appreciate your assistance!

    Soichi Hayashi
    Indiana University

Re: [slurm-users] Down nodes

Reply via email to