Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

Brian Andrus Mon, 08 Jul 2019 13:05:36 -0700

Your problem here is that the configuration for the nodes in questionhave an incorrect amount of memory set for them. Looks like you have itset in bytes instead of megabytes


In your slurm.conf you should look at the RealMemory setting:


*RealMemory*
   Size of real memory on the node in megabytes (e.g. "2048"). The
   default value is 1.

I would suggest RealMemory=191879 , where I suspect you haveRealMemory=196489092


Brian Andrus

On 7/8/2019 11:59 AM, Robert Kudyba wrote:

I’m new to Slurm and we have a 3 node + head node cluster runningCentos 7 and Bright Cluster 8.1. Their support sent me here as theysay Slurm is configured optimally to allow multiple tasks to run.However at times a job will hold up new jobs. Are there any other logsI can look at and/or settings to change to prevent this or alert mewhen this is happening? Here are some tests and commands that I hopewill illuminate where I may be going wrong. The slurn.conf file hasthese options set:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60

I also see /var/log/slurmctld is loaded with errors like these:

[2019-07-03T02:21:30.913] error: _slurm_rpc_node_registrationnode=node003: Invalid argument[2019-07-03T02:54:50.655] error: Node node002 has low real_memory size(191879 < 196489092)[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registrationnode=node002: Invalid argument[2019-07-03T02:54:50.655] error: Node node001 has low real_memory size(191883 < 196489092)[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registrationnode=node001: Invalid argument[2019-07-03T02:54:50.655] error: Node node003 has low real_memory size(191879 < 196489092)[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registrationnode=node003: Invalid argument[2019-07-03T03:28:10.293] error: Node node002 has low real_memory size(191879 < 196489092)[2019-07-03T03:28:10.293] error: _slurm_rpc_node_registrationnode=node002: Invalid argument[2019-07-03T03:28:10.293] error: Node node003 has low real_memory size(191879 < 196489092)


squeue
JOBID PARTITION NAME  USER  ST TIME NODES NODELIST(REASON)
352   defq TensorFl myuser PD 0:00 3 (Resources)

 scontrol show jobid -dd 352
JobId=352 JobName=TensorFlowGPUTest
UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
Priority=4294901741 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-07-02T16:57:59
Partition=defq AllocNode:Sid=ourcluster:386851
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=3,node=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/home/myuser/cnn_gpu.sh
WorkDir=/home/myuser
StdErr=/home/myuser/slurm-352.out
StdIn=/dev/null
StdOut=/home/myuser/slurm-352.out
Power=

Another test showed the below:
sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq*    drain
node002        1     defq*    drain
node003        1     defq*    drain

sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       slurm     2019-05-17T10:05:26 node[001-003]


[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
------------ ------------------------
----------------------------------------------------
Slurm defq node001..node003
Slurm gpuq
[ourcluster->jobqueue(slurm)]% use defq
[ourcluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node003: Thread(s) per core: 1
node003: Core(s) per socket: 12
node003: Socket(s): 2
node001: Thread(s) per core: 1
node001: Core(s) per socket: 12
node001: Socket(s): 2
node002: Thread(s) per core: 1
node002: Core(s) per socket: 12
node002: Socket(s): 2

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01
AvailableFeatures=(null)
Ac tiveFeatures=(null)
Gres=gpu:1
NodeAddr=node001 NodeHostName=node001 Version=17.11
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1

State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/AMCS_label=N/A

Partitions=defq
BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17
CfgTRES=cpu=24,mem=196489092M,billing=24
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]


sinfo
PARTITION AVAIL TI MELIMIT NODES STATE NODELIST
defq* up infinite 3 drain node[001-003]
gpuq up infinite 0 n/a


scontrol show nodes| grep -i mem
RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

Reply via email to