Also note that there was a bug in an older version of SLURM (pre-17-something) that corrupted the database in a way that prevented GPU/gres fencing. If that affected you and you're still using the same database, GPU fencing probably isn't working. There's a way of fixing this manually through sql hacking; however, we just went with a virgin database when we last upgraded in order to get it working (and sucked the accounting data into XDMoD).
On Thu, Jan 14, 2021 at 6:36 PM Fulcomer, Samuel <samuel_fulco...@brown.edu> wrote: > AllowedDevicesFile should not be necessary. The relevant devices are > identified in gres.conf. "ConstrainDevices=yes" should be all that's needed. > > nvidia-smi will only see the allocated GPUs. Note that a single allocated > GPU will always be shown by nvidia-smi to be GPU 0, regardless of its > actual hardware ordinal, and GPU_DEVICE_ORDINAL will be set to 0. The value > of SLURM_STEP_GPUS will be set to the actual device number (N, where the > device is /dev/nvidiaN). > > On Thu, Jan 14, 2021 at 6:20 PM Ryan Novosielski <novos...@rutgers.edu> > wrote: > >> AFAIK, if you have this set up correctly, nvidia-smi will be restricted >> too, though I think we were seeing a bug there at one time in this version. >> >> -- >> #BlackLivesMatter >> ____ >> || \\UTGERS, >> |---------------------------*O*--------------------------- >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS >> Campus >> || \\ of NJ | Office of Advanced Research Computing - MSB C630, >> Newark >> `' >> >> On Jan 14, 2021, at 18:05, Abhiram Chintangal <achintan...@berkeley.edu> >> wrote: >> >> >> Sean, >> >> Thanks for the clarification.I noticed that I am missing the >> "AllowedDevices" option in mine. After adding this, the GPU allocations >> started working. (Slurm version 18.08.8) >> >> I was also incorrectly using "nvidia-smi" as a check. >> >> Regards, >> >> Abhiram >> >> On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby <scro...@unimelb.edu.au> >> wrote: >> >>> Hi Abhiram, >>> >>> You need to configure cgroup.conf to constrain the devices a job has >>> access to. See https://slurm.schedmd.com/cgroup.conf.html >>> >>> My cgroup.conf is >>> >>> CgroupAutomount=yes >>> >>> AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf" >>> >>> ConstrainCores=yes >>> ConstrainRAMSpace=yes >>> ConstrainSwapSpace=yes >>> ConstrainDevices=yes >>> >>> TaskAffinity=no >>> >>> CgroupMountpoint=/sys/fs/cgroup >>> >>> The ConstrainDevices=yes is the key to stopping jobs from having access >>> to GPUs they didn't request. >>> >>> Sean >>> >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal < >>> achintan...@berkeley.edu> wrote: >>> >>>> * UoM notice: External email. Be cautious of links, attachments, or >>>> impersonation attempts * >>>> ------------------------------ >>>> Hello, >>>> >>>> I recently set up a small cluster at work using Warewulf/Slurm. >>>> Currently, I am not able to get the scheduler to >>>> work well with GPU's (Gres). >>>> >>>> While slurm is able to filter by GPU type, it allocates all the GPU's >>>> on the node. See below: >>>> >>>> [abhiram@whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu >>>>> nvidia-smi --query-gpu=index,name --format=csv >>>>> index, name >>>>> 0, Tesla P100-PCIE-16GB >>>>> 1, Tesla P100-PCIE-16GB >>>>> 2, Tesla P100-PCIE-16GB >>>>> 3, Tesla P100-PCIE-16GB >>>>> [abhiram@whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu >>>>> nvidia-smi --query-gpu=index,name --format=csv >>>>> index, name >>>>> 0, TITAN RTX >>>>> 1, TITAN RTX >>>>> 2, TITAN RTX >>>>> 3, TITAN RTX >>>>> 4, TITAN RTX >>>>> 5, TITAN RTX >>>>> 6, TITAN RTX >>>>> 7, TITAN RTX >>>>> >>>> >>>> I am fairly new to Slurm and still figuring out my way around it. I >>>> would really appreciate any help with this. >>>> >>>> For your reference, I attached the slurm.conf and gres.conf files. >>>> >>>> Best, >>>> >>>> Abhiram >>>> >>>> -- >>>> >>>> Abhiram Chintangal >>>> QB3 Nogales Lab >>>> Bioinformatics Specialist @ Howard Hughes Medical Institute >>>> University of California Berkeley >>>> 708D Stanley Hall, Berkeley, CA 94720 >>>> Phone (510)666-3344 >>>> >>>> >> >> -- >> >> Abhiram Chintangal >> QB3 Nogales Lab >> Bioinformatics Specialist @ Howard Hughes Medical Institute >> University of California Berkeley >> 708D Stanley Hall, Berkeley, CA 94720 >> Phone (510)666-3344 >> >>