Hi all, I'm a new administrator to Slurm and I've just got my new cluster up and running. We started getting a lot of "Socket timed out on send/recv operation" errors when submitting jobs, and also if you try to "squeue" while others are submitting jobs. The job does eventually run after about a minute, but the entire system feels very sluggish and obviously this isn't normal. Not sure whats going on here...
Head nodema-vm-slurm01 and ma-vm-slurm02 are virtual machines running on a Hyper-V host with a common NFS share between all of the worker and head nodes. Head nodes have 8CPU/8GB running on Ubuntu 16.04. All network interconnect is 10GBE. slurmctld Log Snippet: Oct 15 11:08:57 ma-vm-slurm01 slurmctld[1603]: validate_node_specs: Node ma-pm-hpc03 unexpectedly rebooted boot_time=1539616111 last response=1539615694 Oct 15 12:40:21 ma-vm-slurm01 slurmctld[1603]: _slurm_rpc_submit_batch_job: JobId=476 InitPrio=4294901364 usec=33061624 Oct 15 12:40:36 ma-vm-slurm01 slurmctld[1603]: email msg to mrga...@ncsu.edu: Slurm Job_id=476 Name=StochUnifOpt_Chu_3D_forCluster.m.job Began, Queued time 00:00:48 Oct 15 12:40:36 ma-vm-slurm01 slurmctld[1603]: sched: Allocate JobId=476 NodeList=ma-pm-hpc17 #CPUs=8 Partition=math-hpc Oct 15 12:40:45 ma-vm-slurm01 slurmctld[16956]: error: Failed to exec /usr/bin/sendmail: No such file or directory Oct 15 13:12:00 ma-vm-slurm01 slurmctld[1603]: _slurm_rpc_submit_batch_job: JobId=477 InitPrio=4294901363 usec=75836582 Oct 15 13:12:23 ma-vm-slurm01 slurmctld[1603]: email msg to mrga...@ncsu.edu: Slurm Job_id=477 Name=StochUnifOpt_Chu_3D_forCluster.m.job Began, Queued time 00:01:39 Oct 15 13:12:23 ma-vm-slurm01 slurmctld[1603]: sched: Allocate JobId=477 NodeList=ma-pm-hpc17 #CPUs=8 Partition=math-hpc Oct 15 13:12:34 ma-vm-slurm01 slurmctld[18952]: error: Failed to exec /usr/bin/sendmail: No such file or directory Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: sched: _slurm_rpc_allocate_resources JobId=478 NodeList=ma-pm-hpc17 usec=12600182 Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: error: Job allocate response msg send failure, killing JobId=478 Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=478 WTERMSIG 15 Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=478 done Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=476 WEXITSTATUS 0 Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: email msg to mrga...@ncsu.edu: Slurm Job_id=476 Name=StochUnifOpt_Chu_3D_forCluster.m.job Ended, Run time 00:56:22, COMPLETED, ExitCode 0 Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=476 done Oct 15 13:37:03 ma-vm-slurm01 slurmctld[19285]: error: Failed to exec /usr/bin/sendmail: No such file or directory slurm.conf: # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=math-hpc ControlMachine=ma-vm-slurm01 #ControlAddr= BackupController=ma-vm-slurm02 #BackupAddr= # SlurmUser=slurm SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= MailProg=/usr/bin/sendmail StateSaveLocation=/mnt/HpcStor/etc/slurm/state SlurmdSpoolDir=/var/spool/slurmd.spool SwitchType=switch/none MpiDefault=none MpiParams=ports=12000-12999 SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/linuxproc #PluginDir= CacheGroups=0 #FirstJobId= ReturnToService=1 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog=/etc/slurm/slurm.epilog.clean #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/cgroup TaskPluginParam=Cores,Verbose #TrackWCKey=no #TreeWidth=50 TmpFS=/tmp #UsePAM= # # TIMERS SlurmctldTimeout=120 SlurmdTimeout=300 InactiveLimit=600 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/builtin #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= SelectType=select/cons_res SelectTypeParameters=CR_CPU_MEMORY FastSchedule=1 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=5 #SlurmctldLogFile= SlurmdDebug=5 #SlurmdLogFile= JobCompType=jobcomp/SlurmDBD #JobCompLoc= # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 # AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=ma-vm-slurm01 AccountingStorageBackupHost=ma-vm-slurm02 #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= AccountingStorageEnforce=associations,limits # # COMPUTE NODES #GresTypes=gpu NodeName=ma-pm-hpc[01-10] RealMemory=128000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN NodeName=ma-pm-hpc[11,12] RealMemory=128000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN NodeName=ma-pm-hpc[13-23] RealMemory=192000 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 State=UNKNOWN PartitionName=math-hpc Nodes=ma-pm-hpc[01-23] Default=YES MaxTime=10-00:00:00 State=UP Shared=FORCE DefMemPerCPU=7680 Thanks, *Kirk J. Main* Systems Administrator, Department of Mathematics College of Sciences P: 919.515.6315 kjm...@ncsu.edu