Your problem here is that the configuration for the nodes in question have an incorrect amount of memory set for them. Looks like you have it set in bytes instead of megabytes

In your slurm.conf you should look at the RealMemory setting:

*RealMemory*
   Size of real memory on the node in megabytes (e.g. "2048"). The
   default value is 1.

I would suggest RealMemory=191879 , where I suspect you have RealMemory=196489092

Brian Andrus

On 7/8/2019 11:59 AM, Robert Kudyba wrote:
I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright Cluster 8.1. Their support sent me here as they say Slurm is configured optimally to allow multiple tasks to run. However at times a job will hold up new jobs. Are there any other logs I can look at and/or settings to change to prevent this or alert me when this is happening? Here are some tests and commands that I hope will illuminate where I may be going wrong. The slurn.conf file has these options set:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60

I also see /var/log/slurmctld is loaded with errors like these:
[2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: Invalid argument [2019-07-03T02:54:50.655] error: Node node002 has low real_memory size (191879 < 196489092) [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: Invalid argument [2019-07-03T02:54:50.655] error: Node node001 has low real_memory size (191883 < 196489092) [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: Invalid argument [2019-07-03T02:54:50.655] error: Node node003 has low real_memory size (191879 < 196489092) [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: Invalid argument [2019-07-03T03:28:10.293] error: Node node002 has low real_memory size (191879 < 196489092) [2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: Invalid argument [2019-07-03T03:28:10.293] error: Node node003 has low real_memory size (191879 < 196489092)

squeue
JOBID PARTITION NAME  USER  ST TIME NODES NODELIST(REASON)
352   defq TensorFl myuser PD 0:00 3 (Resources)

 scontrol show jobid -dd 352
JobId=352 JobName=TensorFlowGPUTest
UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
Priority=4294901741 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-07-02T16:57:59
Partition=defq AllocNode:Sid=ourcluster:386851
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=3,node=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/home/myuser/cnn_gpu.sh
WorkDir=/home/myuser
StdErr=/home/myuser/slurm-352.out
StdIn=/dev/null
StdOut=/home/myuser/slurm-352.out
Power=

Another test showed the below:
sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq*    drain
node002        1     defq*    drain
node003        1     defq*    drain

sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       slurm     2019-05-17T10:05:26 node[001-003]


[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
------------ ------------------------
----------------------------------------------------
Slurm defq node001..node003
Slurm gpuq
[ourcluster->jobqueue(slurm)]% use defq
[ourcluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node003: Thread(s) per core: 1
node003: Core(s) per socket: 12
node003: Socket(s): 2
node001: Thread(s) per core: 1
node001: Core(s) per socket: 12
node001: Socket(s): 2
node002: Thread(s) per core: 1
node002: Core(s) per socket: 12
node002: Socket(s): 2

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01
AvailableFeatures=(null)
Ac tiveFeatures=(null)
Gres=gpu:1
NodeAddr=node001 NodeHostName=node001 Version=17.11
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=defq
BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17
CfgTRES=cpu=24,mem=196489092M,billing=24
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]


sinfo
PARTITION AVAIL TI MELIMIT NODES STATE NODELIST
defq* up infinite 3 drain node[001-003]
gpuq up infinite 0 n/a


scontrol show nodes| grep -i mem
RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]

Reply via email to