Hi Abhiram, You need to configure cgroup.conf to constrain the devices a job has access to. See https://slurm.schedmd.com/cgroup.conf.html
My cgroup.conf is CgroupAutomount=yes AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes ConstrainDevices=yes TaskAffinity=no CgroupMountpoint=/sys/fs/cgroup The ConstrainDevices=yes is the key to stopping jobs from having access to GPUs they didn't request. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal <achintan...@berkeley.edu> wrote: > * UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts * > ------------------------------ > Hello, > > I recently set up a small cluster at work using Warewulf/Slurm. Currently, > I am not able to get the scheduler to > work well with GPU's (Gres). > > While slurm is able to filter by GPU type, it allocates all the GPU's on > the node. See below: > > [abhiram@whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu nvidia-smi >> --query-gpu=index,name --format=csv >> index, name >> 0, Tesla P100-PCIE-16GB >> 1, Tesla P100-PCIE-16GB >> 2, Tesla P100-PCIE-16GB >> 3, Tesla P100-PCIE-16GB >> [abhiram@whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu >> nvidia-smi --query-gpu=index,name --format=csv >> index, name >> 0, TITAN RTX >> 1, TITAN RTX >> 2, TITAN RTX >> 3, TITAN RTX >> 4, TITAN RTX >> 5, TITAN RTX >> 6, TITAN RTX >> 7, TITAN RTX >> > > I am fairly new to Slurm and still figuring out my way around it. I would > really appreciate any help with this. > > For your reference, I attached the slurm.conf and gres.conf files. > > Best, > > Abhiram > > -- > > Abhiram Chintangal > QB3 Nogales Lab > Bioinformatics Specialist @ Howard Hughes Medical Institute > University of California Berkeley > 708D Stanley Hall, Berkeley, CA 94720 > Phone (510)666-3344 > >