AllowedDevicesFile should not be necessary. The relevant devices are identified in gres.conf. "ConstrainDevices=yes" should be all that's needed.
nvidia-smi will only see the allocated GPUs. Note that a single allocated GPU will always be shown by nvidia-smi to be GPU 0, regardless of its actual hardware ordinal, and GPU_DEVICE_ORDINAL will be set to 0. The value of SLURM_STEP_GPUS will be set to the actual device number (N, where the device is /dev/nvidiaN). On Thu, Jan 14, 2021 at 6:20 PM Ryan Novosielski <novos...@rutgers.edu> wrote: > AFAIK, if you have this set up correctly, nvidia-smi will be restricted > too, though I think we were seeing a bug there at one time in this version. > > -- > #BlackLivesMatter > ____ > || \\UTGERS, > |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, > Newark > `' > > On Jan 14, 2021, at 18:05, Abhiram Chintangal <achintan...@berkeley.edu> > wrote: > > > Sean, > > Thanks for the clarification.I noticed that I am missing the > "AllowedDevices" option in mine. After adding this, the GPU allocations > started working. (Slurm version 18.08.8) > > I was also incorrectly using "nvidia-smi" as a check. > > Regards, > > Abhiram > > On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby <scro...@unimelb.edu.au> > wrote: > >> Hi Abhiram, >> >> You need to configure cgroup.conf to constrain the devices a job has >> access to. See https://slurm.schedmd.com/cgroup.conf.html >> >> My cgroup.conf is >> >> CgroupAutomount=yes >> AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf" >> >> ConstrainCores=yes >> ConstrainRAMSpace=yes >> ConstrainSwapSpace=yes >> ConstrainDevices=yes >> >> TaskAffinity=no >> >> CgroupMountpoint=/sys/fs/cgroup >> >> The ConstrainDevices=yes is the key to stopping jobs from having access >> to GPUs they didn't request. >> >> Sean >> >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal < >> achintan...@berkeley.edu> wrote: >> >>> * UoM notice: External email. Be cautious of links, attachments, or >>> impersonation attempts * >>> ------------------------------ >>> Hello, >>> >>> I recently set up a small cluster at work using Warewulf/Slurm. >>> Currently, I am not able to get the scheduler to >>> work well with GPU's (Gres). >>> >>> While slurm is able to filter by GPU type, it allocates all the GPU's on >>> the node. See below: >>> >>> [abhiram@whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu >>>> nvidia-smi --query-gpu=index,name --format=csv >>>> index, name >>>> 0, Tesla P100-PCIE-16GB >>>> 1, Tesla P100-PCIE-16GB >>>> 2, Tesla P100-PCIE-16GB >>>> 3, Tesla P100-PCIE-16GB >>>> [abhiram@whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu >>>> nvidia-smi --query-gpu=index,name --format=csv >>>> index, name >>>> 0, TITAN RTX >>>> 1, TITAN RTX >>>> 2, TITAN RTX >>>> 3, TITAN RTX >>>> 4, TITAN RTX >>>> 5, TITAN RTX >>>> 6, TITAN RTX >>>> 7, TITAN RTX >>>> >>> >>> I am fairly new to Slurm and still figuring out my way around it. I >>> would really appreciate any help with this. >>> >>> For your reference, I attached the slurm.conf and gres.conf files. >>> >>> Best, >>> >>> Abhiram >>> >>> -- >>> >>> Abhiram Chintangal >>> QB3 Nogales Lab >>> Bioinformatics Specialist @ Howard Hughes Medical Institute >>> University of California Berkeley >>> 708D Stanley Hall, Berkeley, CA 94720 >>> Phone (510)666-3344 >>> >>> > > -- > > Abhiram Chintangal > QB3 Nogales Lab > Bioinformatics Specialist @ Howard Hughes Medical Institute > University of California Berkeley > 708D Stanley Hall, Berkeley, CA 94720 > Phone (510)666-3344 > >