Sorry about that. “NJT” should have read “but;” apparently my phone decided I 
was talking about our local transit authority. 😓

On Aug 25, 2020, at 10:30, Ryan Novosielski <novos...@rutgers.edu> wrote:

 I believe that’s done via a QoS on the partition. Have a look at the docs 
there, and I think “require” is a good key word to look for.

Cgroups should also help with this, NJT I’ve been troubleshooting a problem 
where that seems not to be working correctly.

--
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Aug 25, 2020, at 10:13, Willy Markuske <wmarku...@sdsc.edu> wrote:



Hello,

I'm trying to restrict access to gpu resources on a cluster I maintain for a 
research group. There are two nodes put into a partition with gres gpu 
resources defined. User can access these resources by submitting their job 
under the gpu partition and defining a gres=gpu.

When a user includes the flag --gres=gpu:# they are allocated the number of 
gpus and slurm properly allocates them. If a user requests only 1 gpu they only 
see CUDA_VISIBLE_DEVICES=1. However, if a user does not include the 
--gres=gpu:# flag they can still submit a job to the partition and are then 
able to see all the GPUs. This has led to some bad actors running jobs on all 
GPUs that other users have allocated and causing OOM errors on the gpus.

Is it possible, and where would I find the documentation on doing so, to 
require users to define a --gres=gpu:# to be able to submit to a partition? So 
far reading the gres documentation doesn't seem to have yielded any word on 
this issue specifically.

Regards,

--

Willy Markuske

HPC Systems Engineer

<SDSClogo-plusname-red.jpg>

Research Data Services

P: (858) 246-5593

Reply via email to