Nevermind. This was a layer 8 problem. I was editing the wrong slurm.conf. We recently switched to using RPMs, and I was accidentally edited the file in the location used before we switched to using RPMs. It turns out those errors were always there in slurmctld.log, and no one ever noticed. Now that I am using the output of 'slurmd -C'  in the correct file, those errors have gone away.

What is interesting is the configuration produced by Slurmd -C treats each NUMA node as a separate socket (4 sockets) so the old configuration in slurm.conf matched the physical configuration (2 sockets), so the 'correct' physical configuration had been causing those errors.

Prentice

On 1/17/19 3:09 PM, Prentice Bisbal wrote:
It appears that 'slurmd -C is not returning the correct information for some of the systems in my very heterogeneous cluster.

For example, take the node dawson081:

[root@dawson081 ~]# slurmd -C
NodeName=dawson081 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64554
UpTime=2-09:30:47

Since Boards and CPUS are mutually exclusive, I omitted CPUs and added this line to my slurm.conf:

NodeName=dawson[064,066,068-069,071-072,074-079,081,083,085-086,088-099,101-102,105,108-117] Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64554

When I restart slurm, however, I get the following messages in slurmctld.log:

[2019-01-17T14:54:47.788] error: Node dawson081 has high socket,core,thread count (4,8,1 > 2,16,1), extra resources ignored

lscpu on that same node shows a different hardware layout:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Model name:            AMD Opteron(TM) Processor 6274
Stepping:              2
CPU MHz:               2200.000
BogoMIPS:              4399.39
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31

Both slurmd and slurmctld are version 18.08.4. I built the Slurm RPMs for both at the same time on the same system, so they were linked to the same hwloc. Any ideas why there's a discrepancy? How should I deal with this?

Both the compute node and the Slurm controller are using CentOS 6.10 and have hwloc-1.5-3 installed.

Thanks for the help


Reply via email to