Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-15 Thread Ryan Novosielski
Do you have any more information about that? I think that’s the bug I alluded to earlier in the conversation, and I believe I’m affected by it, but don’t know how to tell, how to fix it, or how to refer to it if I wanted to ask SchedMD (we have a contract). -- #BlackLivesMatter || \\UTGERS

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Fulcomer, Samuel
Also note that there was a bug in an older version of SLURM (pre-17-something) that corrupted the database in a way that prevented GPU/gres fencing. If that affected you and you're still using the same database, GPU fencing probably isn't working. There's a way of fixing this manually through sql h

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Fulcomer, Samuel
AllowedDevicesFile should not be necessary. The relevant devices are identified in gres.conf. "ConstrainDevices=yes" should be all that's needed. nvidia-smi will only see the allocated GPUs. Note that a single allocated GPU will always be shown by nvidia-smi to be GPU 0, regardless of its actual h

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Abhiram Chintangal
Ryan, That's good to know! It would be great to get this working as users are used to checking via nvidia-smi. For now, I have a few jobs ready for the coming weekend! Will check on this later. Thanks for your help! Abhiram On Thu, Jan 14, 2021 at 3:20 PM Ryan Novosielski wrote: > AFAIK, if

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Ryan Novosielski
AFAIK, if you have this set up correctly, nvidia-smi will be restricted too, though I think we were seeing a bug there at one time in this version. -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielsk

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Abhiram Chintangal
Sean, Thanks for the clarification.I noticed that I am missing the "AllowedDevices" option in mine. After adding this, the GPU allocations started working. (Slurm version 18.08.8) I was also incorrectly using "nvidia-smi" as a check. Regards, Abhiram On Thu, Jan 14, 2021 at 12:22 AM Sean Crosb

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Ole Holm Nielsen
Hi Sean, On 1/14/21 9:19 AM, Sean Crosby wrote: Hi Abhiram, You need to configure cgroup.conf to constrain the devices a job has access to. See https://slurm.schedmd.com/cgroup.conf.html My cgroup.conf is CgroupAutomount=yes AllowedDevicesFile="

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Sean Crosby
Hi Abhiram, You need to configure cgroup.conf to constrain the devices a job has access to. See https://slurm.schedmd.com/cgroup.conf.html My cgroup.conf is CgroupAutomount=yes AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf" ConstrainCores=yes ConstrainRAMSpace=yes Co