cgroups should work correctly _if_ you're not running with an old corrupted slurm database.
There was a bug in a much earlier version of slurm that corrupted the database in a way that the cgroups/accounting code could no longer fence GPUs. This was fixed in a later version, but the database corruption carries forward. Apparently the db can be fixed manually, but we're just starting with a new install and fresh db. On Tue, Aug 25, 2020 at 11:03 AM Ryan Novosielski <novos...@rutgers.edu> wrote: > Sorry about that. “NJT” should have read “but;” apparently my phone > decided I was talking about our local transit authority. 😓 > > On Aug 25, 2020, at 10:30, Ryan Novosielski <novos...@rutgers.edu> wrote: > > I believe that’s done via a QoS on the partition. Have a look at the > docs there, and I think “require” is a good key word to look for. > > Cgroups should also help with this, NJT I’ve been troubleshooting a > problem where that seems not to be working correctly. > > -- > ____ > || \\UTGERS, > |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, > Newark > `' > > On Aug 25, 2020, at 10:13, Willy Markuske <wmarku...@sdsc.edu> wrote: > > > > Hello, > > I'm trying to restrict access to gpu resources on a cluster I maintain for > a research group. There are two nodes put into a partition with gres gpu > resources defined. User can access these resources by submitting their job > under the gpu partition and defining a gres=gpu. > > When a user includes the flag --gres=gpu:# they are allocated the number > of gpus and slurm properly allocates them. If a user requests only 1 gpu > they only see CUDA_VISIBLE_DEVICES=1. However, if a user does not include > the --gres=gpu:# flag they can still submit a job to the partition and are > then able to see all the GPUs. This has led to some bad actors running jobs > on all GPUs that other users have allocated and causing OOM errors on the > gpus. > > Is it possible, and where would I find the documentation on doing so, to > require users to define a --gres=gpu:# to be able to submit to a partition? > So far reading the gres documentation doesn't seem to have yielded any word > on this issue specifically. > > Regards, > -- > > Willy Markuske > > HPC Systems Engineer > <SDSClogo-plusname-red.jpg> > > Research Data Services > > P: (858) 246-5593 > >