Re: [slurm-users] [EXT] incorrect number of cpu's being reported in srun job

Sean Crosby Thu, 17 Jun 2021 21:38:26 -0700

Hi Sid,

On our cluster, it performs just like your PBS cluster.


$ srun -N 1 --cpus-per-task 8 --time 01:00:00 --mem 2g --partition physicaltest 
-q hpcadmin --pty python3
srun: job 27060036 queued and waiting for resources
srun: job 27060036 has been allocated resources
Python 3.6.8 (default, Aug 13 2020, 07:46:32)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.cpu_count()
72
>>> len(os.sched_getaffinity(0))
8

We do have cgroups set up to limit which CPUs a user has access to. We do it 
with

ProctrackType           = proctrack/cgroup
TaskPlugin              = task/affinity,task/cgroup

Sean

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sid 
Young <sid.yo...@gmail.com>
Sent: Friday, 18 June 2021 14:20
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [EXT] [slurm-users] incorrect number of cpu's being reported in srun 
job

External email: Please exercise caution

________________________________
G'Day all,

I've had a question from a user of our new HPC, the following should explain it:

➜ srun -N 1 --cpus-per-task 8 --time 01:00:00 --mem 2g --pty python3
Python 3.6.8 (default, Nov 16 2020, 16:55:22)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.cpu_count()
256
>>> len(os.sched_getaffinity(0))
256
>>>

The output of os.cpu_count() is correct: there are 256 CPUs on the server, but 
the output of len(os.sched_getaffinity(0)) is still 256 when I was expecting it 
to be 8 - the number of CPUs this process is restricted to. Is my slurm command 
incorrect? When I run a similar test on XXXXXX I get the expected behaviour:

➜ qsub -I -l select=1:ncpus=4:mem=1gb
qsub: waiting for job 9616042.pbs to start
qsub: job 9616042.pbs ready
➜ python3
Python 3.4.10 (default, Dec 13 2019, 16:20:47) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.cpu_count()
72
>>> len(os.sched_getaffinity(0))
4
>>>

This seems to be a problem for me as I have a program provided by a third-party 
company that keeps trying to run with 256 threads and crashes. The program is a 
compiled binary so I don't know if they're just grabbing the number of CPUs or 
correctly getting the scheduler affinity, but it seems as though TRI's HPC will 
return the total number of CPUs in any case. There aren't any options with the 
program to set the number of threads manually.

My question to the group is what's causing this? Do I need a cgroups plugin?

I think these are the relevant lines from the slurm.conf file:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
ReturnToService=1
CpuFreqGovernors=OnDemand,Performance,UserSpace
CpuFreqDef=Performance




Sid Young
Translational Research Institute

Re: [slurm-users] [EXT] incorrect number of cpu's being reported in srun job

Reply via email to