Hello,

we're using cgroups to restrict access to the GPUs.

What I found particularly helpful, are the slides by Marshall Garey from last year's Slurm User Group Meeting: https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf (NVML didn't work for us for some reason I cannot recall, but listing the GPU device files explicitly was not a big deal)

Best,
Christoph


On 25/08/2020 16.12, Willy Markuske wrote:
Hello,

I'm trying to restrict access to gpu resources on a cluster I maintain for a research group. There are two nodes put into a partition with gres gpu resources defined. User can access these resources by submitting their job under the gpu partition and defining a gres=gpu.

When a user includes the flag --gres=gpu:# they are allocated the number of gpus and slurm properly allocates them. If a user requests only 1 gpu they only see CUDA_VISIBLE_DEVICES=1. However, if a user does not include the --gres=gpu:# flag they can still submit a job to the partition and are then able to see all the GPUs. This has led to some bad actors running jobs on all GPUs that other users have allocated and causing OOM errors on the gpus.

Is it possible, and where would I find the documentation on doing so, to require users to define a --gres=gpu:# to be able to submit to a partition? So far reading the gres documentation doesn't seem to have yielded any word on this issue specifically.

Regards,

--

Willy Markuske

HPC Systems Engineer

        

Research Data Services

P: (858) 246-5593


--
Dr. Christoph Brüning
Universität Würzburg
Rechenzentrum
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499

Reply via email to