Brian, Thank you for your reply and thanks for setting the email title. I forgot to edit it before I sent it!
I am not sure how I can reply to your your reply.. but I hope this make it so the right place.. I've updated slurm.conf to increase the controller debug level > SlurmctldDebug=5 I now see additional log output (debug). [2021-07-30T22:42:05.255] debug: Spawning ping agent for slurm4-compute[2-6,10,12-14] [2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30] not responding, setting DOWN It's still very sparse, but it looks like slurm is trying to ping nodes that are already removed (they don't exist anymore - as they are removed by slurm_suspend.sh script) I tried sinfo -R but it doesn't really give much info.. $ sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2021-07-30T22:42:05 slurm4-compute[9,15,19-22,30] These machines are gone, so it should not respond. $ ping slurm4-compute9 ping: slurm4-compute9: Name or service not known This is expected. Why is slurm keeps trying to contact the node that's already removed? slurm_suspend.sh does the following to "remove" node from the partition. > scontrol update nodename=${host} nodeaddr="(null)" Maybe this isn't the correct way to do it? Is there a way to force slurm to forget about the node? I tried "scontrol update node=$node state=idle", but this only works for a few minutes until slurm's ping agent kicks in and marking them down again. Thanks!! Soichi On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi <hayas...@iu.edu> wrote: > Hello. I need a help with troubleshooting our slurm cluster. > > I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud > infrastructure (Jetstream) using an elastic computing mechanism ( > https://slurm.schedmd.com/elastic_computing.html). Our cluster works for > the most part, but for some reason, a few of our nodes constantly goes into > "down" state. > > PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES > STATE NODELIST > cloud* up 2-00:00:00 1-infinite no YES:4 all 10 > idle~ slurm9-compute[1-5,10,12-15] > cloud* up 2-00:00:00 1-infinite no YES:4 all 5 > down slurm9-compute[6-9,11] > > The only log I see in the slurm log is this.. > > [2021-07-30T15:10:55.889] Invalid node state transition requested for node > slurm9-compute6 from=COMPLETING to=RESUME > [2021-07-30T15:21:37.339] Invalid node state transition requested for node > slurm9-compute6 from=COMPLETING* to=RESUME > [2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set to: > completing > [2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to > DOWN > [2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to > IDLE > .. > [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not > responding, setting DOWN > > WIth elastic computing, any unused nodes are automatically removed > (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are > *expected* to not respond once they are removed, but they should not be > marked as DOWN. They should simply be set to "idle". > > To work around this issue, I am running the following cron job. > > 0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume > > This "works" somewhat.. but our nodes go to "DOWN" state so often that > running this every hour is not enough. > > Here is the full content of our slurm.conf > > root@slurm9:~# cat /etc/slurm-llnl/slurm.conf > ClusterName=slurm9 > ControlMachine=slurm9 > > SlurmUser=slurm > SlurmdUser=root > SlurmctldPort=6817 > SlurmdPort=6818 > AuthType=auth/munge > StateSaveLocation=/tmp > SlurmdSpoolDir=/tmp/slurmd > SwitchType=switch/none > MpiDefault=none > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid > ProctrackType=proctrack/pgid > ReturnToService=1 > Prolog=/usr/local/sbin/slurm_prolog.sh > > # > # TIMERS > SlurmctldTimeout=300 > SlurmdTimeout=300 > #make slurm a little more tolerant here > MessageTimeout=30 > TCPTimeout=15 > BatchStartTimeout=20 > GetEnvTimeout=20 > InactiveLimit=0 > MinJobAge=604800 > KillWait=30 > Waittime=0 > # > # SCHEDULING > SchedulerType=sched/backfill > SelectType=select/cons_res > SelectTypeParameters=CR_CPU_Memory > #FastSchedule=0 > > # LOGGING > SlurmctldDebug=3 > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > SlurmdDebug=3 > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > JobCompType=jobcomp/none > > # ACCOUNTING > JobAcctGatherType=jobacct_gather/linux > JobAcctGatherFrequency=30 > > AccountingStorageType=accounting_storage/filetxt > AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log > > #CLOUD CONFIGURATION > PrivateData=cloud > ResumeProgram=/usr/local/sbin/slurm_resume.sh > SuspendProgram=/usr/local/sbin/slurm_suspend.sh > ResumeRate=1 #number of nodes per minute that can be created; 0 means no > limit > ResumeTimeout=900 #max time in seconds between ResumeProgram running and > when the node is ready for use > SuspendRate=1 #number of nodes per minute that can be suspended/destroyed > SuspendTime=600 #time in seconds before an idle node is suspended > SuspendTimeout=300 #time between running SuspendProgram and the node being > completely down > TreeWidth=30 > > NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388 > PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES > MaxTime=48:00:00 State=UP Shared=YES > > I appreciate your assistance! > > Soichi Hayashi > Indiana University > > >