So we figured out the problem with "slurmd -C": we had run rpmbuild on the
POWER9 node, but did not have the hwloc-package installed. The build
process looks for this, and if not found, will apparently note use
hwloc/lstopo even if installed post-build.
Now Slurm reports the expected topology for
ore L#21
PU L#42 (P#84)
PU L#43 (P#85)
...
L3 L#19 (10MB) + L2 L#19 (512KB)
L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
PU L#76 (P#152)
PU L#77 (P#153)
L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
PU L#78 (P#156)
PU L#79 (P#157)
So my guess here is that GPU0,GPU1 would get Cores=0-19, and GPU2,GPU3 get
Cores=20-39 as numbered by lstopo?
- Keith Ball
Hi All,
We have installed slurm 17.11.8 on IBM AC922 nodes (POWER9) that have 4
GPUs each, and are running RHEL 7.5-ALT. Physically, these are 2-socket
nodes, with each socket having 20 cores. Depending on SMT setting (SMT1,
SMT2, SMT4) there can be 40, 80, or 160 "processors/CPUs" virtually.
Som
Hi All,
We are looking to have time-based partitions; e.g. a"day" and "night"
partition (using the same group of compute nodes).
1.) For a “night” partition, jobs will only be allocated resources one the
“night-time” window is reached (e.g. 6pm – 7am). Ideally, the jobs in the
“night” partition
Hi All,
I am having an issue with jobs that end, either by an "scancel", or being
killed due to job wall time timeout, or even in with srun --pty interactive
shell), exiting the shell. An excerpt from /var/log/slurmd where a typical
job was running:
[2018-03-05T12:48:49.165] _run_prolog: run job