Re: [slurm-users] 'slurmd -c' not returning correct information

Prentice Bisbal Thu, 17 Jan 2019 12:41:15 -0800

Nevermind. This was a layer 8 problem. I was editing the wrongslurm.conf. We recently switched to using RPMs, and I was accidentallyedited the file in the location used before we switched to using RPMs.It turns out those errors were always there in slurmctld.log, and no oneever noticed. Now that I am using the output of 'slurmd -C' in thecorrect file, those errors have gone away.

What is interesting is the configuration produced by Slurmd -C treatseach NUMA node as a separate socket (4 sockets) so the old configurationin slurm.conf matched the physical configuration (2 sockets), so the'correct' physical configuration had been causing those errors.


Prentice

On 1/17/19 3:09 PM, Prentice Bisbal wrote:

It appears that 'slurmd -C is not returning the correct informationfor some of the systems in my very heterogeneous cluster.
For example, take the node dawson081:

[root@dawson081 ~]# slurmd -C
NodeName=dawson081 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1RealMemory=64554
UpTime=2-09:30:47
Since Boards and CPUS are mutually exclusive, I omitted CPUs and addedthis line to my slurm.conf:
NodeName=dawson[064,066,068-069,071-072,074-079,081,083,085-086,088-099,101-102,105,108-117]Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1RealMemory=64554
When I restart slurm, however, I get the following messages inslurmctld.log:
[2019-01-17T14:54:47.788] error: Node dawson081 has highsocket,core,thread count (4,8,1 > 2,16,1), extra resources ignored
lscpu on that same node shows a different hardware layout:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Model name:            AMD Opteron(TM) Processor 6274
Stepping:              2
CPU MHz:               2200.000
BogoMIPS:              4399.39
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31
Both slurmd and slurmctld are version 18.08.4. I built the Slurm RPMsfor both at the same time on the same system, so they were linked tothe same hwloc. Any ideas why there's a discrepancy? How should I dealwith this?
Both the compute node and the Slurm controller are using CentOS 6.10and have hwloc-1.5-3 installed.
Thanks for the help

Re: [slurm-users] 'slurmd -c' not returning correct information

Reply via email to