Re: [slurm-users] Slurm on POWER9

2018-09-14 Thread Keith Ball
So we figured out the problem with "slurmd -C": we had run rpmbuild on the POWER9 node, but did not have the hwloc-package installed. The build process looks for this, and if not found, will apparently note use hwloc/lstopo even if installed post-build. Now Slurm reports the expected topology for

Re: [slurm-users] Slurm on POWER9

2018-09-12 Thread Keith Ball
ore L#21 PU L#42 (P#84) PU L#43 (P#85) ... L3 L#19 (10MB) + L2 L#19 (512KB) L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 PU L#76 (P#152) PU L#77 (P#153) L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 PU L#78 (P#156) PU L#79 (P#157) So my guess here is that GPU0,GPU1 would get Cores=0-19, and GPU2,GPU3 get Cores=20-39 as numbered by lstopo? - Keith Ball

[slurm-users] Slurm on POWER9

2018-09-10 Thread Keith Ball
Hi All, We have installed slurm 17.11.8 on IBM AC922 nodes (POWER9) that have 4 GPUs each, and are running RHEL 7.5-ALT. Physically, these are 2-socket nodes, with each socket having 20 cores. Depending on SMT setting (SMT1, SMT2, SMT4) there can be 40, 80, or 160 "processors/CPUs" virtually. Som

[slurm-users] Time-based partitions

2018-03-12 Thread Keith Ball
Hi All, We are looking to have time-based partitions; e.g. a"day" and "night" partition (using the same group of compute nodes). 1.) For a “night” partition, jobs will only be allocated resources one the “night-time” window is reached (e.g. 6pm – 7am). Ideally, the jobs in the “night” partition

[slurm-users] "stepd terminated due to job not ending with signals"

2018-03-06 Thread Keith Ball
Hi All, I am having an issue with jobs that end, either by an "scancel", or being killed due to job wall time timeout, or even in with srun --pty interactive shell), exiting the shell. An excerpt from /var/log/slurmd where a typical job was running: [2018-03-05T12:48:49.165] _run_prolog: run job