Thanks Christoph and others for the help. Turns out it is very simply setting cgroups that I had most of the way set months ago and even left myself a note to uncomment ConstrainDevices=yes in cgroup.conf when the GPU systems came online.
Kept racking my brain why the gres settings didn't include anything while it would set the number of requested GPUs correctly. Everything is working as expected now. Willy Markuske HPC Systems Engineer Research Data Services P: (858) 246-5593 On 8/25/20 8:24 AM, Christoph Brüning wrote: > Hello, > > we're using cgroups to restrict access to the GPUs. > > What I found particularly helpful, are the slides by Marshall Garey > from last year's Slurm User Group Meeting: > https://urldefense.com/v3/__https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf__;!!Mih3wA!XNe605WUGPer00S7oSxp5Vkj06UAdkDNiE-hhGSr9HvCBjneYA_8p1C12xnCD17p$ > (NVML didn't work for us for some reason I cannot recall, but listing > the GPU device files explicitly was not a big deal) > > Best, > Christoph > > > On 25/08/2020 16.12, Willy Markuske wrote: >> Hello, >> >> I'm trying to restrict access to gpu resources on a cluster I >> maintain for a research group. There are two nodes put into a >> partition with gres gpu resources defined. User can access these >> resources by submitting their job under the gpu partition and >> defining a gres=gpu. >> >> When a user includes the flag --gres=gpu:# they are allocated the >> number of gpus and slurm properly allocates them. If a user requests >> only 1 gpu they only see CUDA_VISIBLE_DEVICES=1. However, if a user >> does not include the --gres=gpu:# flag they can still submit a job to >> the partition and are then able to see all the GPUs. This has led to >> some bad actors running jobs on all GPUs that other users have >> allocated and causing OOM errors on the gpus. >> >> Is it possible, and where would I find the documentation on doing so, >> to require users to define a --gres=gpu:# to be able to submit to a >> partition? So far reading the gres documentation doesn't seem to have >> yielded any word on this issue specifically. >> >> Regards, >> >> -- >> >> Willy Markuske >> >> HPC Systems Engineer >> >> >> >> Research Data Services >> >> P: (858) 246-5593 >> >
signature.asc
Description: OpenPGP digital signature