That 'not responding' is the issue and usually means 1 of 2 things:
1) slurmd is not running on the node
2) something on the network is stopping the communication between the
node and the master (firewall, selinux, congestion, bad nic, routes, etc)
Brian Andrus
On 7/30/2021 3:51 PM, Soichi Hayashi wrote:
Brian,
Thank you for your reply and thanks for setting the email title. I
forgot to edit it before I sent it!
I am not sure how I can reply to your your reply.. but I hope this
make it so the right place..
I've updated slurm.conf to increase the controller debug level
> SlurmctldDebug=5
I now see additional log output (debug).
[2021-07-30T22:42:05.255] debug: Spawning ping agent for
slurm4-compute[2-6,10,12-14]
[2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30]
not responding, setting DOWN
It's still very sparse, but it looks like slurm is trying to ping
nodes that are already removed (they don't exist anymore - as they are
removed by slurm_suspend.sh script)
I tried sinfo -R but it doesn't really give much info..
$ sinfo -R
REASON USER TIMESTAMP NODELIST
Not responding slurm 2021-07-30T22:42:05
slurm4-compute[9,15,19-22,30]
These machines are gone, so it should not respond.
$ ping slurm4-compute9
ping: slurm4-compute9: Name or service not known
This is expected.
Why is slurm keeps trying to contact the node that's already removed?
slurm_suspend.sh does the following to "remove" node from the partition.
> scontrol update nodename=${host} nodeaddr="(null)"
Maybe this isn't the correct way to do it? Is there a way to force
slurm to forget about the node? I tried "scontrol update node=$node
state=idle", but this only works for a few minutes until slurm's ping
agent kicks in and marking them down again.
Thanks!!
Soichi
On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi <hayas...@iu.edu
<mailto:hayas...@iu.edu>> wrote:
Hello. I need a help with troubleshooting our slurm cluster.
I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
infrastructure (Jetstream) using an elastic computing
mechanism (https://slurm.schedmd.com/elastic_computing.html
<https://slurm.schedmd.com/elastic_computing.html>). Our cluster
works for the most part, but for some reason, a few of our nodes
constantly goes into "down" state.
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS
NODES STATE NODELIST
cloud* up 2-00:00:00 1-infinite no YES:4 all
10 idle~ slurm9-compute[1-5,10,12-15]
cloud* up 2-00:00:00 1-infinite no YES:4 all
5 down slurm9-compute[6-9,11]
The only log I see in the slurm log is this..
[2021-07-30T15:10:55.889] Invalid node state transition requested
for node slurm9-compute6 from=COMPLETING to=RESUME
[2021-07-30T15:21:37.339] Invalid node state transition requested
for node slurm9-compute6 from=COMPLETING* to=RESUME
[2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason
set to: completing
[2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state
set to DOWN
[2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state
set to IDLE
..
[2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
responding, setting DOWN
WIth elastic computing, any unused nodes are automatically removed
(by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
*expected* to not respond once they are removed, but they should
not be marked as DOWN. They should simply be set to "idle".
To work around this issue, I am running the following cron job.
0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume
This "works" somewhat.. but our nodes go to "DOWN" state so often
that running this every hour is not enough.
Here is the full content of our slurm.conf
root@slurm9:~# cat /etc/slurm-llnl/slurm.conf
ClusterName=slurm9
ControlMachine=slurm9
SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=1
Prolog=/usr/local/sbin/slurm_prolog.sh
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
#make slurm a little more tolerant here
MessageTimeout=30
TCPTimeout=15
BatchStartTimeout=20
GetEnvTimeout=20
InactiveLimit=0
MinJobAge=604800
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#FastSchedule=0
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/filetxt
AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log
#CLOUD CONFIGURATION
PrivateData=cloud
ResumeProgram=/usr/local/sbin/slurm_resume.sh
SuspendProgram=/usr/local/sbin/slurm_suspend.sh
ResumeRate=1 #number of nodes per minute that can be created; 0
means no limit
ResumeTimeout=900 #max time in seconds between ResumeProgram
running and when the node is ready for use
SuspendRate=1 #number of nodes per minute that can be
suspended/destroyed
SuspendTime=600 #time in seconds before an idle node is suspended
SuspendTimeout=300 #time between running SuspendProgram and the
node being completely down
TreeWidth=30
NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388
PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES
MaxTime=48:00:00 State=UP Shared=YES
I appreciate your assistance!
Soichi Hayashi
Indiana University