Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

Robert Kudyba Mon, 08 Jul 2019 13:50:57 -0700

Thanks Brian indeed we did have it set in bytes. I set it to the MB value. 
Hoping this takes care of the situation.


> On Jul 8, 2019, at 4:02 PM, Brian Andrus <toomuc...@gmail.com> wrote:
> 
> Your problem here is that the configuration for the nodes in question have an 
> incorrect amount of memory set for them. Looks like you have it set in bytes 
> instead of megabytes
> 
> In your slurm.conf you should look at the RealMemory setting:
> 
> 
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The default value 
> is 1. 
> 
> I would suggest RealMemory=191879 , where I suspect you have 
> RealMemory=196489092
> 
> Brian Andrus
> On 7/8/2019 11:59 AM, Robert Kudyba wrote:
>> I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 
>> and Bright Cluster 8.1. Their support sent me here as they say Slurm is 
>> configured optimally to allow multiple tasks to run. However at times a job 
>> will hold up new jobs. Are there any other logs I can look at and/or 
>> settings to change to prevent this or alert me when this is happening? Here 
>> are some tests and commands that I hope will illuminate where I may be going 
>> wrong. The slurn.conf file has these options set:
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU
>> SchedulerTimeSlice=60
>> 
>> I also see /var/log/slurmctld is loaded with errors like these:
>> [2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: 
>> Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node002 has low real_memory size 
>> (191879 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: 
>> Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node001 has low real_memory size 
>> (191883 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: 
>> Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node003 has low real_memory size 
>> (191879 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: 
>> Invalid argument
>> [2019-07-03T03:28:10.293] error: Node node002 has low real_memory size 
>> (191879 < 196489092)
>> [2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: 
>> Invalid argument
>> [2019-07-03T03:28:10.293] error: Node node003 has low real_memory size 
>> (191879 < 196489092)
>> 
>> squeue
>> JOBID PARTITION NAME          USER  ST TIME NODES NODELIST(REASON)
>> 352   defq   TensorFl myuser PD 0:00 3     (Resources)
>> 
>>  scontrol show jobid -dd 352
>> JobId=352 JobName=TensorFlowGPUTest
>> UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
>> Priority=4294901741 Nice=0 Account=(null) QOS=normal
>> JobState=PENDING Reason=Resources Dependency=(null)
>> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>> DerivedExitCode=0:0
>> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>> SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
>> StartTime=Unknown EndTime=Unknown Deadline=N/A
>> PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> LastSchedEval=2019-07-02T16:57:59
>> Partition=defq AllocNode:Sid=ourcluster:386851
>> ReqNodeList=(null) ExcNodeList=(null)
>> NodeList=(null)
>> NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>> TRES=cpu=3,node=3
>> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>> Features=(null) DelayBoot=00:00:00
>> Gres=gpu:1 Reservation=(null)
>> OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
>> Command=/home/myuser/cnn_gpu.sh
>> WorkDir=/home/myuser
>> StdErr=/home/myuser/slurm-352.out
>> StdIn=/dev/null
>> StdOut=/home/myuser/slurm-352.out
>> Power=
>> 
>> Another test showed the below:
>> sinfo -N
>> NODELIST   NODES PARTITION STATE
>> node001        1     defq*    drain
>> node002        1     defq*    drain
>> node003        1     defq*    drain
>> 
>> sinfo -R
>> REASON               USER      TIMESTAMP           NODELIST
>> Low RealMemory       slurm     2019-05-17T10:05:26 node[001-003]
>> 
>> 
>> [ciscluster]% jobqueue
>> [ciscluster->jobqueue(slurm)]% ls
>> Type Name Nodes
>> ------------ ------------------------
>> ----------------------------------------------------
>> Slurm defq node001..node003
>> Slurm gpuq
>> [ourcluster->jobqueue(slurm)]% use defq
>> [ourcluster->jobqueue(slurm)->defq]% get options
>> QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP
>> 
>> pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'" 
>> node003: Thread(s) per core: 1 
>> node003: Core(s) per socket: 12 
>> node003: Socket(s): 2 
>> node001: Thread(s) per core: 1 
>> node001: Core(s) per socket: 12 
>> node001: Socket(s): 2 
>> node002: Thread(s) per core: 1 
>> node002: Core(s) per socket: 12 
>> node002: Socket(s): 2 
>> 
>> scontrol show nodes node001 
>> NodeName=node001 Arch=x86_64 CoresPerSocket=12 
>> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 
>> AvailableFeatures=(null) 
>> Ac tiveFeatures=(null) 
>> Gres=gpu:1 
>> NodeAddr=node001 NodeHostName=node001 Version=17.11 
>> OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018 
>> RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1 
>> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A 
>> Partitions=defq 
>> BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17 
>> CfgTRES=cpu=24,mem=196489092M,billing=24 
>> AllocTRES= 
>> CapWatts=n/a 
>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s 
>> Reason=Low RealMemory [slurm@2019-05-17T10:05:26] 
>> 
>> 
>> sinfo 
>> PARTITION AVAIL TI MELIMIT NODES STATE NODELIST 
>> defq* up infinite 3 drain node[001-003] 
>> gpuq up infinite 0 n/a 
>> 
>> 
>> scontrol show nodes| grep -i mem 
>> RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1 
>> CfgTRES=cpu=24,mem=196489092M,billing=24 
>> Reason=Low RealMemory [slurm@2019-05-17T10:05:26] 
>> RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1 
>> CfgTRES=cpu=24,mem=196489092M,billing=24 
>> Reason=Low RealMemory [slurm@2019-05-17T10:05:26] 
>> RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1 
>> CfgTRES=cpu=24,mem=196489092M,billing=24 
>> Reason=Low RealMemory [slurm@2019-05-17T10:05:26]

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

Reply via email to