Re: [slurm-users] CPUSpecList confusion

2022-12-15 Thread Paul Raines
Turns out on that new node I was running hwloc in a cgroup restricted to cores 0-13 so that was causing the issue. In an unrestricted cgroup shell, "hwloc-ls --only pu" works properly and gives me the correct SLURM mapping. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Thu, 15 Dec 2022

Re: [slurm-users] CPUSpecList confusion

2022-12-15 Thread Wagner, Marcus
Hmm… That one is strange. Can you try just hwloc-ls? I wonder, how slurmd would get that information, if it is not hwloc-based Best Marcus Von unterwegs gesendet. > Am 15.12.2022 um 16:00 schrieb Paul Raines : > >  > Nice find! > > Unfortunately this does not work on the original box this

Re: [slurm-users] CPUSpecList confusion

2022-12-15 Thread Paul Raines
Nice find! Unfortunately this does not work on the original box this whole issue started on where I found the "alternating scheme" # slurmd -C NodeName=foobar CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=256312 UpTime=5-14:55:31 # hwloc-ls --only pu PU L#0

Re: [slurm-users] CPUSpecList confusion

2022-12-14 Thread Marcus Wagner
Hi Paul, as Slurm uses hwloc, I was looking into these tools a little bit deeper. Using your script, I saw e.g. the following output on one node: === 31495434 CPU_IDs=21-23,25 21-23,25 === 31495433 CPU_IDs=16-18,20 10-11,15,17 === 31487399 CPU_IDs=15 9 That does not match your schemes and on fi

Re: [slurm-users] CPUSpecList confusion

2022-12-14 Thread Paul Raines
Ugh. Guess I cannot count. The mapping on that last node DOES work with the "alternating" scheme where we have 0 0 1 2 2 4 3 6 4 8 5 10 6 12 7 14 8 16 9 18 10 20 11 22 12 1 13 3 14 5 15 7 16 9 17 11 18 13 19 15 20 17 21 19 22 21 23 23 so CPU_IDs=8-11,20-23 does correspond

Re: [slurm-users] CPUSpecList confusion

2022-12-14 Thread Paul Raines
Yes, I see that on some of my other machines too. So apicid is definitely not what SLURM is using but somehow just lines up that way on this one machine I have. I think the issue is cgroups counts starting at 0 all the cores on the first socket, then all the cores on the second socket. But

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Marcus Wagner
Hi Paul, sorry to say, but that has to be some coincidence on your system. I've never seen Slurm reporting using corenumbers, which are higher than the total number of cores. I have e.g. a intel Platinum 8160 here. 24 Cores per Socket, no HyperThreading activated. Yet here the last lines of /

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Sean Maxwell
Nice find. Thanks for sharing back. On Tue, Dec 13, 2022 at 10:39 AM Paul Raines wrote: > > Yes, looks like SLURM is using the apicid that is in /proc/cpuinfo > The first 14 cpus in /proc/cpu (procs 0-13) have apicid > 0,2,4,6,8,10,12,14,16,20,22,24,26,28 in /proc/cpuinfo > > So after setting Cp

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
Yes, looks like SLURM is using the apicid that is in /proc/cpuinfo The first 14 cpus in /proc/cpu (procs 0-13) have apicid 0,2,4,6,8,10,12,14,16,20,22,24,26,28 in /proc/cpuinfo So after setting CpuSpecList=0,2,4,6,8,10,12,14,16,18,20,22,24,26 in slurm.conf it appears to be doing what I want

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Sean Maxwell
In the slurm.conf manual they state the CpuSpecList ids are "abstract", and in the CPU management docs they enforce the notion that the abstract Slurm IDs are not related to the Linux hardware IDs, so that is probably the source of the behavior. I unfortunately don't have more information. On Tue,

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
Hmm. Actually looks like confusion between CPU IDs on the system and what SLURM thinks the IDs are # scontrol -d show job 8 ... Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES= ... # cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective 7-10,39-42 -- Paul Raines (htt

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
Oh but that does explain the CfgTRES=cpu=14. With the CpuSpecList below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense. The issue remains that thought the number of cpus in CpuSpecList is taken into account, the exact IDs seem to be ignored. -- Paul Raines (http://help.nmr.mgh

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
I have tried it both ways with the same result. The assigned CPUs will be both in and out of the range given to CpuSpecList I tried setting using commas instead of ranges so used CpuSpecList=0,1,2,3,4,5,6,7,8,9,10,11,12,13 But still does not work $ srun -p basic -N 1 --ntasks-per-node=1 --m

Re: [slurm-users] CPUSpecList confusion

2022-12-12 Thread Sean Maxwell
Hi Paul, Nodename=foobar \ >CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 \ >RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \ >TmpDisk=600 Gres=gpu:nvidia_rtx_a6000:1 > > The slurm.conf also has: > > ProctrackType=proctrack/cgroup > TaskPlugin=task/a

[slurm-users] CPUSpecList confusion

2022-12-09 Thread Paul Raines
I have a Rocky 8 system that with hyperthreading has 64 cores I want the first 14 cores reservered for logged in users and non-SLURM work. I want SLURM to use the rest. I configured the box to boot with systemd.unified_cgroup_hierarchy=1 to use cgroup v2 I ran systemctl set-property user.