Hi all - I’m new to slurm, and in many ways it’s been very nice to work with, but I’m having an issue trying to properly set up thread/core/socket counts on nodes. Basically, if I don’t specify anything except CPUs, the node is available, but doesn’t appear to know about cores and hyperthreading. If I do try to specify that info it claims that the numbers aren’t consistent and sets the node to drain.
This is all on CentOS 7, slurm 18.08, and FastSchedule is set to 0. First type of node, 2 x 8 core CPUs, hyperthreading on, nothing specified in slurm.conf except CPUs. /proc/cpuinfo confirms that there are 32 “cpus”, with the expected values for physical id and core id. From slurm.conf NodeName=compute-2-0 NodeAddr=10.1.255.250 CPUs=32 Weight=20511700 Feature=rack-2,32CPUs from scontrol show node NodeName=compute-2-0 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=32 CPULoad=0.04 AvailableFeatures=rack-2,32CPUs ActiveFeatures=rack-2,32CPUs Gres=(null) NodeAddr=10.1.255.250 NodeHostName=compute-2-0 Version=18.08 OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 RealMemory=257742 AllocMem=0 FreeMem=255703 Sockets=32 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=913567 Weight=20511700 Owner=N/A MCS_label=N/A Partitions=CLUSTER,WHEEL,n2013f BootTime=2018-10-10T11:06:42 SlurmdStartTime=2018-10-10T11:07:16 CfgTRES=cpu=32,mem=257742M,billing=94 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Second type of node, 2 x 4 core CPUs, hyperthreading on. /proc/cpuinfo confirms that there are 16 “cpus”, with the expected values for physical id and core id. If I set the numbers of sockets/cores/threads as I think is correct (note that this is a different type of machine than the previous), NodeName=compute-0-0 NodeAddr=10.1.255.253 CPUs=16 Weight=20495900 Feature=rack-0,16CPUs Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 I get the following NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=16 CPULoad=0.42 AvailableFeatures=rack-0,16CPUs ActiveFeatures=rack-0,16CPUs Gres=(null) NodeAddr=10.1.255.253 NodeHostName=compute-0-0 Version=18.08 OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 RealMemory=11842 AllocMem=0 FreeMem=11335 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=275125 Weight=20495900 Owner=N/A MCS_label=N/A Partitions=CLUSTER,WHEEL,ib_qdr BootTime=2018-10-10T11:06:55 SlurmdStartTime=2018-10-10T11:07:34 CfgTRES=cpu=16,mem=11842M,billing=18 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low socket*core count [root@2018-10-10T10:26:14] I feel like there are couple of things that are suspicious, but I’m not sure 1. I get the impression that slurm is supposed to be able to automatically figure out the architecture of the node, but in the first example there’s no evidence of that in the scontrol output. 2. When I set the various architecture related parameters it claims that the numbers don’t match, even though sockets*cores*threads = 2*4*2 = 16 = CPUs Does anyone have any idea as to what’s going on, or what other information would be useful for debugging? thanks, Noam