Nevermind. This was a layer 8 problem. I was editing the wrong
slurm.conf. We recently switched to using RPMs, and I was accidentally
edited the file in the location used before we switched to using RPMs.
It turns out those errors were always there in slurmctld.log, and no one
ever noticed. Now that I am using the output of 'slurmd -C' in the
correct file, those errors have gone away.
What is interesting is the configuration produced by Slurmd -C treats
each NUMA node as a separate socket (4 sockets) so the old configuration
in slurm.conf matched the physical configuration (2 sockets), so the
'correct' physical configuration had been causing those errors.
Prentice
On 1/17/19 3:09 PM, Prentice Bisbal wrote:
It appears that 'slurmd -C is not returning the correct information
for some of the systems in my very heterogeneous cluster.
For example, take the node dawson081:
[root@dawson081 ~]# slurmd -C
NodeName=dawson081 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1
RealMemory=64554
UpTime=2-09:30:47
Since Boards and CPUS are mutually exclusive, I omitted CPUs and added
this line to my slurm.conf:
NodeName=dawson[064,066,068-069,071-072,074-079,081,083,085-086,088-099,101-102,105,108-117]
Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1
RealMemory=64554
When I restart slurm, however, I get the following messages in
slurmctld.log:
[2019-01-17T14:54:47.788] error: Node dawson081 has high
socket,core,thread count (4,8,1 > 2,16,1), extra resources ignored
lscpu on that same node shows a different hardware layout:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 21
Model: 1
Model name: AMD Opteron(TM) Processor 6274
Stepping: 2
CPU MHz: 2200.000
BogoMIPS: 4399.39
Virtualization: AMD-V
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
Both slurmd and slurmctld are version 18.08.4. I built the Slurm RPMs
for both at the same time on the same system, so they were linked to
the same hwloc. Any ideas why there's a discrepancy? How should I deal
with this?
Both the compute node and the Slurm controller are using CentOS 6.10
and have hwloc-1.5-3 installed.
Thanks for the help