Kirk, MailProg=/usr/bin/sendmail MailProg should be the program used to SEND mail ie. /bin/mail not sendmail If I am not wrong int he jargon MailProg is a MUA not an MTA (sendmail is an MTA)
On Thu, 18 Oct 2018 at 19:01, Kirk Main <kjm...@ncsu.edu> wrote: > Hi all, > > I'm a new administrator to Slurm and I've just got my new cluster up and > running. We started getting a lot of "Socket timed out on send/recv > operation" errors when submitting jobs, and also if you try to "squeue" > while others are submitting jobs. The job does eventually run after about a > minute, but the entire system feels very sluggish and obviously this isn't > normal. Not sure whats going on here... > > Head nodema-vm-slurm01 and ma-vm-slurm02 are virtual machines running on a > Hyper-V host with a common NFS share between all of the worker and head > nodes. Head nodes have 8CPU/8GB running on Ubuntu 16.04. All network > interconnect is 10GBE. > > slurmctld Log Snippet: > > Oct 15 11:08:57 ma-vm-slurm01 slurmctld[1603]: validate_node_specs: Node > ma-pm-hpc03 unexpectedly rebooted boot_time=1539616111 last > response=1539615694 > > Oct 15 12:40:21 ma-vm-slurm01 slurmctld[1603]: > _slurm_rpc_submit_batch_job: JobId=476 InitPrio=4294901364 usec=33061624 > > Oct 15 12:40:36 ma-vm-slurm01 slurmctld[1603]: email msg to > mrga...@ncsu.edu: Slurm Job_id=476 > Name=StochUnifOpt_Chu_3D_forCluster.m.job Began, Queued time 00:00:48 > > Oct 15 12:40:36 ma-vm-slurm01 slurmctld[1603]: sched: Allocate JobId=476 > NodeList=ma-pm-hpc17 #CPUs=8 Partition=math-hpc > > Oct 15 12:40:45 ma-vm-slurm01 slurmctld[16956]: error: Failed to exec > /usr/bin/sendmail: No such file or directory > > Oct 15 13:12:00 ma-vm-slurm01 slurmctld[1603]: > _slurm_rpc_submit_batch_job: JobId=477 InitPrio=4294901363 usec=75836582 > > Oct 15 13:12:23 ma-vm-slurm01 slurmctld[1603]: email msg to > mrga...@ncsu.edu: Slurm Job_id=477 > Name=StochUnifOpt_Chu_3D_forCluster.m.job Began, Queued time 00:01:39 > > Oct 15 13:12:23 ma-vm-slurm01 slurmctld[1603]: sched: Allocate JobId=477 > NodeList=ma-pm-hpc17 #CPUs=8 Partition=math-hpc > > Oct 15 13:12:34 ma-vm-slurm01 slurmctld[18952]: error: Failed to exec > /usr/bin/sendmail: No such file or directory > > Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: sched: > _slurm_rpc_allocate_resources JobId=478 NodeList=ma-pm-hpc17 usec=12600182 > > Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: error: Job allocate > response msg send failure, killing JobId=478 > > Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=478 > WTERMSIG 15 > > Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=478 > done > > Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=476 > WEXITSTATUS 0 > > Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: email msg to > mrga...@ncsu.edu: Slurm Job_id=476 > Name=StochUnifOpt_Chu_3D_forCluster.m.job Ended, Run time 00:56:22, > COMPLETED, ExitCode 0 > > Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=476 > done > > Oct 15 13:37:03 ma-vm-slurm01 slurmctld[19285]: error: Failed to exec > /usr/bin/sendmail: No such file or directory > > slurm.conf: > > # slurm.conf file generated by configurator.html. > > # > > # See the slurm.conf man page for more information. > > # > > ClusterName=math-hpc > > ControlMachine=ma-vm-slurm01 > > #ControlAddr= > > BackupController=ma-vm-slurm02 > > #BackupAddr= > > # > > SlurmUser=slurm > > SlurmdUser=root > > SlurmctldPort=6817 > > SlurmdPort=6818 > > AuthType=auth/munge > > #JobCredentialPrivateKey= > > #JobCredentialPublicCertificate= > > MailProg=/usr/bin/sendmail > > StateSaveLocation=/mnt/HpcStor/etc/slurm/state > > SlurmdSpoolDir=/var/spool/slurmd.spool > > SwitchType=switch/none > > MpiDefault=none > > MpiParams=ports=12000-12999 > > SlurmctldPidFile=/var/run/slurmctld.pid > > SlurmdPidFile=/var/run/slurmd.pid > > ProctrackType=proctrack/linuxproc > > #PluginDir= > > CacheGroups=0 > > #FirstJobId= > > ReturnToService=1 > > #MaxJobCount= > > #PlugStackConfig= > > #PropagatePrioProcess= > > #PropagateResourceLimits= > > #PropagateResourceLimitsExcept= > > #Prolog= > > #Epilog=/etc/slurm/slurm.epilog.clean > > #SrunProlog= > > #SrunEpilog= > > #TaskProlog= > > #TaskEpilog= > > TaskPlugin=task/cgroup > > TaskPluginParam=Cores,Verbose > > #TrackWCKey=no > > #TreeWidth=50 > > TmpFS=/tmp > > #UsePAM= > > # > > # TIMERS > > SlurmctldTimeout=120 > > SlurmdTimeout=300 > > InactiveLimit=600 > > MinJobAge=300 > > KillWait=30 > > Waittime=0 > > # > > # SCHEDULING > > SchedulerType=sched/builtin > > #SchedulerAuth= > > #SchedulerPort= > > #SchedulerRootFilter= > > SelectType=select/cons_res > > SelectTypeParameters=CR_CPU_MEMORY > > FastSchedule=1 > > #PriorityType=priority/multifactor > > #PriorityDecayHalfLife=14-0 > > #PriorityUsageResetPeriod=14-0 > > #PriorityWeightFairshare=100000 > > #PriorityWeightAge=1000 > > #PriorityWeightPartition=10000 > > #PriorityWeightJobSize=1000 > > #PriorityMaxAge=1-0 > > # > > # LOGGING > > SlurmctldDebug=5 > > #SlurmctldLogFile= > > SlurmdDebug=5 > > #SlurmdLogFile= > > JobCompType=jobcomp/SlurmDBD > > #JobCompLoc= > > # > > # ACCOUNTING > > JobAcctGatherType=jobacct_gather/linux > > JobAcctGatherFrequency=30 > > # > > AccountingStorageType=accounting_storage/slurmdbd > > AccountingStorageHost=ma-vm-slurm01 > > AccountingStorageBackupHost=ma-vm-slurm02 > > #AccountingStorageLoc= > > #AccountingStoragePass= > > #AccountingStorageUser= > > AccountingStorageEnforce=associations,limits > > > # > > # COMPUTE NODES > > #GresTypes=gpu > > NodeName=ma-pm-hpc[01-10] RealMemory=128000 Sockets=2 CoresPerSocket=8 > ThreadsPerCore=1 State=UNKNOWN > > NodeName=ma-pm-hpc[11,12] RealMemory=128000 Sockets=2 CoresPerSocket=12 > ThreadsPerCore=1 State=UNKNOWN > > NodeName=ma-pm-hpc[13-23] RealMemory=192000 Sockets=2 CoresPerSocket=14 > ThreadsPerCore=1 State=UNKNOWN > > PartitionName=math-hpc Nodes=ma-pm-hpc[01-23] Default=YES > MaxTime=10-00:00:00 State=UP Shared=FORCE DefMemPerCPU=7680 > > > > Thanks, > *Kirk J. Main* > Systems Administrator, Department of Mathematics > College of Sciences > P: 919.515.6315 > kjm...@ncsu.edu > > >