Thanks Brian indeed we did have it set in bytes. I set it to the MB value. Hoping this takes care of the situation.
> On Jul 8, 2019, at 4:02 PM, Brian Andrus <toomuc...@gmail.com> wrote: > > Your problem here is that the configuration for the nodes in question have an > incorrect amount of memory set for them. Looks like you have it set in bytes > instead of megabytes > > In your slurm.conf you should look at the RealMemory setting: > > > RealMemory > Size of real memory on the node in megabytes (e.g. "2048"). The default value > is 1. > > I would suggest RealMemory=191879 , where I suspect you have > RealMemory=196489092 > > Brian Andrus > On 7/8/2019 11:59 AM, Robert Kudyba wrote: >> I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 >> and Bright Cluster 8.1. Their support sent me here as they say Slurm is >> configured optimally to allow multiple tasks to run. However at times a job >> will hold up new jobs. Are there any other logs I can look at and/or >> settings to change to prevent this or alert me when this is happening? Here >> are some tests and commands that I hope will illuminate where I may be going >> wrong. The slurn.conf file has these options set: >> SelectType=select/cons_res >> SelectTypeParameters=CR_CPU >> SchedulerTimeSlice=60 >> >> I also see /var/log/slurmctld is loaded with errors like these: >> [2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: >> Invalid argument >> [2019-07-03T02:54:50.655] error: Node node002 has low real_memory size >> (191879 < 196489092) >> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: >> Invalid argument >> [2019-07-03T02:54:50.655] error: Node node001 has low real_memory size >> (191883 < 196489092) >> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: >> Invalid argument >> [2019-07-03T02:54:50.655] error: Node node003 has low real_memory size >> (191879 < 196489092) >> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: >> Invalid argument >> [2019-07-03T03:28:10.293] error: Node node002 has low real_memory size >> (191879 < 196489092) >> [2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: >> Invalid argument >> [2019-07-03T03:28:10.293] error: Node node003 has low real_memory size >> (191879 < 196489092) >> >> squeue >> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) >> 352 defq TensorFl myuser PD 0:00 3 (Resources) >> >> scontrol show jobid -dd 352 >> JobId=352 JobName=TensorFlowGPUTest >> UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A >> Priority=4294901741 Nice=0 Account=(null) QOS=normal >> JobState=PENDING Reason=Resources Dependency=(null) >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >> DerivedExitCode=0:0 >> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A >> SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11 >> StartTime=Unknown EndTime=Unknown Deadline=N/A >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 >> LastSchedEval=2019-07-02T16:57:59 >> Partition=defq AllocNode:Sid=ourcluster:386851 >> ReqNodeList=(null) ExcNodeList=(null) >> NodeList=(null) >> NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> TRES=cpu=3,node=3 >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 >> Features=(null) DelayBoot=00:00:00 >> Gres=gpu:1 Reservation=(null) >> OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) >> Command=/home/myuser/cnn_gpu.sh >> WorkDir=/home/myuser >> StdErr=/home/myuser/slurm-352.out >> StdIn=/dev/null >> StdOut=/home/myuser/slurm-352.out >> Power= >> >> Another test showed the below: >> sinfo -N >> NODELIST NODES PARTITION STATE >> node001 1 defq* drain >> node002 1 defq* drain >> node003 1 defq* drain >> >> sinfo -R >> REASON USER TIMESTAMP NODELIST >> Low RealMemory slurm 2019-05-17T10:05:26 node[001-003] >> >> >> [ciscluster]% jobqueue >> [ciscluster->jobqueue(slurm)]% ls >> Type Name Nodes >> ------------ ------------------------ >> ---------------------------------------------------- >> Slurm defq node001..node003 >> Slurm gpuq >> [ourcluster->jobqueue(slurm)]% use defq >> [ourcluster->jobqueue(slurm)->defq]% get options >> QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP >> >> pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'" >> node003: Thread(s) per core: 1 >> node003: Core(s) per socket: 12 >> node003: Socket(s): 2 >> node001: Thread(s) per core: 1 >> node001: Core(s) per socket: 12 >> node001: Socket(s): 2 >> node002: Thread(s) per core: 1 >> node002: Core(s) per socket: 12 >> node002: Socket(s): 2 >> >> scontrol show nodes node001 >> NodeName=node001 Arch=x86_64 CoresPerSocket=12 >> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 >> AvailableFeatures=(null) >> Ac tiveFeatures=(null) >> Gres=gpu:1 >> NodeAddr=node001 NodeHostName=node001 Version=17.11 >> OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018 >> RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1 >> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A >> Partitions=defq >> BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17 >> CfgTRES=cpu=24,mem=196489092M,billing=24 >> AllocTRES= >> CapWatts=n/a >> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> Reason=Low RealMemory [slurm@2019-05-17T10:05:26] >> >> >> sinfo >> PARTITION AVAIL TI MELIMIT NODES STATE NODELIST >> defq* up infinite 3 drain node[001-003] >> gpuq up infinite 0 n/a >> >> >> scontrol show nodes| grep -i mem >> RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1 >> CfgTRES=cpu=24,mem=196489092M,billing=24 >> Reason=Low RealMemory [slurm@2019-05-17T10:05:26] >> RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1 >> CfgTRES=cpu=24,mem=196489092M,billing=24 >> Reason=Low RealMemory [slurm@2019-05-17T10:05:26] >> RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1 >> CfgTRES=cpu=24,mem=196489092M,billing=24 >> Reason=Low RealMemory [slurm@2019-05-17T10:05:26]