It appears that 'slurmd -C is not returning the correct information for some of the systems in my very heterogeneous cluster.

For example, take the node dawson081:

[root@dawson081 ~]# slurmd -C
NodeName=dawson081 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64554
UpTime=2-09:30:47

Since Boards and CPUS are mutually exclusive, I omitted CPUs and added this line to my slurm.conf:

NodeName=dawson[064,066,068-069,071-072,074-079,081,083,085-086,088-099,101-102,105,108-117] Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64554

When I restart slurm, however, I get the following messages in slurmctld.log:

[2019-01-17T14:54:47.788] error: Node dawson081 has high socket,core,thread count (4,8,1 > 2,16,1), extra resources ignored

lscpu on that same node shows a different hardware layout:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Model name:            AMD Opteron(TM) Processor 6274
Stepping:              2
CPU MHz:               2200.000
BogoMIPS:              4399.39
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31

Both slurmd and slurmctld are version 18.08.4. I built the Slurm RPMs for both at the same time on the same system, so they were linked to the same hwloc. Any ideas why there's a discrepancy? How should I deal with this?

Both the compute node and the Slurm controller are using CentOS 6.10 and have hwloc-1.5-3 installed.

Thanks for the help

--
Prentice


Reply via email to