Hi all - I’m new to slurm, and in many ways it’s been very nice to work with, 
but I’m having an issue trying to properly set up thread/core/socket counts on 
nodes.  Basically, if I don’t specify anything except CPUs, the node is 
available, but doesn’t appear to know about cores and hyperthreading.  If I do 
try to specify that info it claims that the numbers aren’t consistent and sets 
the node to drain.

This is all on CentOS 7, slurm 18.08, and FastSchedule is set to 0.

First type of node, 2 x 8 core CPUs, hyperthreading on, nothing specified in 
slurm.conf except CPUs.  /proc/cpuinfo confirms that there are 32 “cpus”, with 
the expected values for physical id and core id.

From slurm.conf
NodeName=compute-2-0 NodeAddr=10.1.255.250 CPUs=32 Weight=20511700 
Feature=rack-2,32CPUs

from scontrol show node
NodeName=compute-2-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=32 CPULoad=0.04
   AvailableFeatures=rack-2,32CPUs
   ActiveFeatures=rack-2,32CPUs
   Gres=(null)
   NodeAddr=10.1.255.250 NodeHostName=compute-2-0 Version=18.08
   OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
   RealMemory=257742 AllocMem=0 FreeMem=255703 Sockets=32 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=913567 Weight=20511700 Owner=N/A 
MCS_label=N/A
   Partitions=CLUSTER,WHEEL,n2013f
   BootTime=2018-10-10T11:06:42 SlurmdStartTime=2018-10-10T11:07:16
   CfgTRES=cpu=32,mem=257742M,billing=94
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Second type of node, 2 x 4 core CPUs, hyperthreading on.  /proc/cpuinfo 
confirms that there are 16 “cpus”, with the expected values for physical id and 
core id.

If I set the numbers of sockets/cores/threads as I think is correct (note that 
this is a different type of machine than the previous), 
NodeName=compute-0-0 NodeAddr=10.1.255.253 CPUs=16 Weight=20495900 
Feature=rack-0,16CPUs Sockets=2 CoresPerSocket=4 ThreadsPerCore=2
I get the following
NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUTot=16 CPULoad=0.42
   AvailableFeatures=rack-0,16CPUs
   ActiveFeatures=rack-0,16CPUs
   Gres=(null)
   NodeAddr=10.1.255.253 NodeHostName=compute-0-0 Version=18.08
   OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
   RealMemory=11842 AllocMem=0 FreeMem=11335 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=275125 Weight=20495900 Owner=N/A 
MCS_label=N/A
   Partitions=CLUSTER,WHEEL,ib_qdr
   BootTime=2018-10-10T11:06:55 SlurmdStartTime=2018-10-10T11:07:34
   CfgTRES=cpu=16,mem=11842M,billing=18
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low socket*core count [root@2018-10-10T10:26:14]

I feel like there are couple of things that are suspicious, but I’m not sure
1. I get the impression that slurm is supposed to be able to automatically 
figure out the architecture of the node, but in the first example there’s no 
evidence of that in the scontrol output.  
2. When I set the various architecture related parameters it claims that the 
numbers don’t match, even though sockets*cores*threads = 2*4*2 = 16 = CPUs

Does anyone have any idea as to what’s going on, or what other information 
would be useful for debugging?



                                                                                
thanks,
                                                                                
Noam

Reply via email to