Re: [slurm-users] Socket timed out on send/recv operation

John Hearns Thu, 18 Oct 2018 12:43:35 -0700

Kirk,
MailProg=/usr/bin/sendmail

MailProg should be the program used to SEND mail ie.  /bin/mail  not
sendmail
If I am not wrong int he jargon MailProg is a MUA not an MTA  (sendmail is
an MTA)








On Thu, 18 Oct 2018 at 19:01, Kirk Main <kjm...@ncsu.edu> wrote:

> Hi all,
>
> I'm a new administrator to Slurm and I've just got my new cluster up and
> running. We started getting a lot of "Socket timed out on send/recv
> operation" errors when submitting jobs, and also if you try to "squeue"
> while others are submitting jobs. The job does eventually run after about a
> minute, but the entire system feels very sluggish and obviously this isn't
> normal. Not sure whats going on here...
>
> Head nodema-vm-slurm01 and ma-vm-slurm02 are virtual machines running on a
> Hyper-V host with a common NFS share between all of the worker and head
> nodes. Head nodes have 8CPU/8GB running on Ubuntu 16.04. All network
> interconnect is 10GBE.
>
> slurmctld Log Snippet:
>
> Oct 15 11:08:57 ma-vm-slurm01 slurmctld[1603]: validate_node_specs: Node
> ma-pm-hpc03 unexpectedly rebooted boot_time=1539616111 last
> response=1539615694
>
> Oct 15 12:40:21 ma-vm-slurm01 slurmctld[1603]:
> _slurm_rpc_submit_batch_job: JobId=476 InitPrio=4294901364 usec=33061624
>
> Oct 15 12:40:36 ma-vm-slurm01 slurmctld[1603]: email msg to
> mrga...@ncsu.edu: Slurm Job_id=476
> Name=StochUnifOpt_Chu_3D_forCluster.m.job Began, Queued time 00:00:48
>
> Oct 15 12:40:36 ma-vm-slurm01 slurmctld[1603]: sched: Allocate JobId=476
> NodeList=ma-pm-hpc17 #CPUs=8 Partition=math-hpc
>
> Oct 15 12:40:45 ma-vm-slurm01 slurmctld[16956]: error: Failed to exec
> /usr/bin/sendmail: No such file or directory
>
> Oct 15 13:12:00 ma-vm-slurm01 slurmctld[1603]:
> _slurm_rpc_submit_batch_job: JobId=477 InitPrio=4294901363 usec=75836582
>
> Oct 15 13:12:23 ma-vm-slurm01 slurmctld[1603]: email msg to
> mrga...@ncsu.edu: Slurm Job_id=477
> Name=StochUnifOpt_Chu_3D_forCluster.m.job Began, Queued time 00:01:39
>
> Oct 15 13:12:23 ma-vm-slurm01 slurmctld[1603]: sched: Allocate JobId=477
> NodeList=ma-pm-hpc17 #CPUs=8 Partition=math-hpc
>
> Oct 15 13:12:34 ma-vm-slurm01 slurmctld[18952]: error: Failed to exec
> /usr/bin/sendmail: No such file or directory
>
> Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: sched:
> _slurm_rpc_allocate_resources JobId=478 NodeList=ma-pm-hpc17 usec=12600182
>
> Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: error: Job allocate
> response msg send failure, killing JobId=478
>
> Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=478
> WTERMSIG 15
>
> Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=478
> done
>
> Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=476
> WEXITSTATUS 0
>
> Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: email msg to
> mrga...@ncsu.edu: Slurm Job_id=476
> Name=StochUnifOpt_Chu_3D_forCluster.m.job Ended, Run time 00:56:22,
> COMPLETED, ExitCode 0
>
> Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=476
> done
>
> Oct 15 13:37:03 ma-vm-slurm01 slurmctld[19285]: error: Failed to exec
> /usr/bin/sendmail: No such file or directory
>
> slurm.conf:
>
> # slurm.conf file generated by configurator.html.
>
> #
>
> # See the slurm.conf man page for more information.
>
> #
>
> ClusterName=math-hpc
>
> ControlMachine=ma-vm-slurm01
>
> #ControlAddr=
>
> BackupController=ma-vm-slurm02
>
> #BackupAddr=
>
> #
>
> SlurmUser=slurm
>
> SlurmdUser=root
>
> SlurmctldPort=6817
>
> SlurmdPort=6818
>
> AuthType=auth/munge
>
> #JobCredentialPrivateKey=
>
> #JobCredentialPublicCertificate=
>
> MailProg=/usr/bin/sendmail
>
> StateSaveLocation=/mnt/HpcStor/etc/slurm/state
>
> SlurmdSpoolDir=/var/spool/slurmd.spool
>
> SwitchType=switch/none
>
> MpiDefault=none
>
> MpiParams=ports=12000-12999
>
> SlurmctldPidFile=/var/run/slurmctld.pid
>
> SlurmdPidFile=/var/run/slurmd.pid
>
> ProctrackType=proctrack/linuxproc
>
> #PluginDir=
>
> CacheGroups=0
>
> #FirstJobId=
>
> ReturnToService=1
>
> #MaxJobCount=
>
> #PlugStackConfig=
>
> #PropagatePrioProcess=
>
> #PropagateResourceLimits=
>
> #PropagateResourceLimitsExcept=
>
> #Prolog=
>
> #Epilog=/etc/slurm/slurm.epilog.clean
>
> #SrunProlog=
>
> #SrunEpilog=
>
> #TaskProlog=
>
> #TaskEpilog=
>
> TaskPlugin=task/cgroup
>
> TaskPluginParam=Cores,Verbose
>
> #TrackWCKey=no
>
> #TreeWidth=50
>
> TmpFS=/tmp
>
> #UsePAM=
>
> #
>
> # TIMERS
>
> SlurmctldTimeout=120
>
> SlurmdTimeout=300
>
> InactiveLimit=600
>
> MinJobAge=300
>
> KillWait=30
>
> Waittime=0
>
> #
>
> # SCHEDULING
>
> SchedulerType=sched/builtin
>
> #SchedulerAuth=
>
> #SchedulerPort=
>
> #SchedulerRootFilter=
>
> SelectType=select/cons_res
>
> SelectTypeParameters=CR_CPU_MEMORY
>
> FastSchedule=1
>
> #PriorityType=priority/multifactor
>
> #PriorityDecayHalfLife=14-0
>
> #PriorityUsageResetPeriod=14-0
>
> #PriorityWeightFairshare=100000
>
> #PriorityWeightAge=1000
>
> #PriorityWeightPartition=10000
>
> #PriorityWeightJobSize=1000
>
> #PriorityMaxAge=1-0
>
> #
>
> # LOGGING
>
> SlurmctldDebug=5
>
> #SlurmctldLogFile=
>
> SlurmdDebug=5
>
> #SlurmdLogFile=
>
> JobCompType=jobcomp/SlurmDBD
>
> #JobCompLoc=
>
> #
>
> # ACCOUNTING
>
> JobAcctGatherType=jobacct_gather/linux
>
> JobAcctGatherFrequency=30
>
> #
>
> AccountingStorageType=accounting_storage/slurmdbd
>
> AccountingStorageHost=ma-vm-slurm01
>
> AccountingStorageBackupHost=ma-vm-slurm02
>
> #AccountingStorageLoc=
>
> #AccountingStoragePass=
>
> #AccountingStorageUser=
>
> AccountingStorageEnforce=associations,limits
>
>
> #
>
> # COMPUTE NODES
>
> #GresTypes=gpu
>
> NodeName=ma-pm-hpc[01-10] RealMemory=128000 Sockets=2 CoresPerSocket=8
> ThreadsPerCore=1 State=UNKNOWN
>
> NodeName=ma-pm-hpc[11,12] RealMemory=128000 Sockets=2 CoresPerSocket=12
> ThreadsPerCore=1 State=UNKNOWN
>
> NodeName=ma-pm-hpc[13-23] RealMemory=192000 Sockets=2 CoresPerSocket=14
> ThreadsPerCore=1 State=UNKNOWN
>
> PartitionName=math-hpc Nodes=ma-pm-hpc[01-23] Default=YES
> MaxTime=10-00:00:00 State=UP Shared=FORCE DefMemPerCPU=7680
>
>
>
> Thanks,
> *Kirk J. Main*
> Systems Administrator, Department of Mathematics
> College of Sciences
> P: 919.515.6315
> kjm...@ncsu.edu
>
>
>

Re: [slurm-users] Socket timed out on send/recv operation

Reply via email to