Your problem here is that the configuration for the nodes in question
have an incorrect amount of memory set for them. Looks like you have it
set in bytes instead of megabytes
In your slurm.conf you should look at the RealMemory setting:
*RealMemory*
Size of real memory on the node in megabytes (e.g. "2048"). The
default value is 1.
I would suggest RealMemory=191879 , where I suspect you have
RealMemory=196489092
Brian Andrus
On 7/8/2019 11:59 AM, Robert Kudyba wrote:
I’m new to Slurm and we have a 3 node + head node cluster running
Centos 7 and Bright Cluster 8.1. Their support sent me here as they
say Slurm is configured optimally to allow multiple tasks to run.
However at times a job will hold up new jobs. Are there any other logs
I can look at and/or settings to change to prevent this or alert me
when this is happening? Here are some tests and commands that I hope
will illuminate where I may be going wrong. The slurn.conf file has
these options set:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
I also see /var/log/slurmctld is loaded with errors like these:
[2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration
node=node003: Invalid argument
[2019-07-03T02:54:50.655] error: Node node002 has low real_memory size
(191879 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration
node=node002: Invalid argument
[2019-07-03T02:54:50.655] error: Node node001 has low real_memory size
(191883 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration
node=node001: Invalid argument
[2019-07-03T02:54:50.655] error: Node node003 has low real_memory size
(191879 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration
node=node003: Invalid argument
[2019-07-03T03:28:10.293] error: Node node002 has low real_memory size
(191879 < 196489092)
[2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration
node=node002: Invalid argument
[2019-07-03T03:28:10.293] error: Node node003 has low real_memory size
(191879 < 196489092)
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
352 defq TensorFl myuser PD 0:00 3 (Resources)
scontrol show jobid -dd 352
JobId=352 JobName=TensorFlowGPUTest
UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
Priority=4294901741 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-07-02T16:57:59
Partition=defq AllocNode:Sid=ourcluster:386851
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=3,node=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/home/myuser/cnn_gpu.sh
WorkDir=/home/myuser
StdErr=/home/myuser/slurm-352.out
StdIn=/dev/null
StdOut=/home/myuser/slurm-352.out
Power=
Another test showed the below:
sinfo -N
NODELIST NODES PARTITION STATE
node001 1 defq* drain
node002 1 defq* drain
node003 1 defq* drain
sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory slurm 2019-05-17T10:05:26 node[001-003]
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
------------ ------------------------
----------------------------------------------------
Slurm defq node001..node003
Slurm gpuq
[ourcluster->jobqueue(slurm)]% use defq
[ourcluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP
pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node003: Thread(s) per core: 1
node003: Core(s) per socket: 12
node003: Socket(s): 2
node001: Thread(s) per core: 1
node001: Core(s) per socket: 12
node001: Socket(s): 2
node002: Thread(s) per core: 1
node002: Core(s) per socket: 12
node002: Socket(s): 2
scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01
AvailableFeatures=(null)
Ac tiveFeatures=(null)
Gres=gpu:1
NodeAddr=node001 NodeHostName=node001 Version=17.11
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=defq
BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17
CfgTRES=cpu=24,mem=196489092M,billing=24
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
sinfo
PARTITION AVAIL TI MELIMIT NODES STATE NODELIST
defq* up infinite 3 drain node[001-003]
gpuq up infinite 0 n/a
scontrol show nodes| grep -i mem
RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]