Hello,

I'm trying to restrict access to gpu resources on a cluster I maintain
for a research group. There are two nodes put into a partition with gres
gpu resources defined. User can access these resources by submitting
their job under the gpu partition and defining a gres=gpu.

When a user includes the flag --gres=gpu:# they are allocated the number
of gpus and slurm properly allocates them. If a user requests only 1 gpu
they only see CUDA_VISIBLE_DEVICES=1. However, if a user does not
include the --gres=gpu:# flag they can still submit a job to the
partition and are then able to see all the GPUs. This has led to some
bad actors running jobs on all GPUs that other users have allocated and
causing OOM errors on the gpus.

Is it possible, and where would I find the documentation on doing so, to
require users to define a --gres=gpu:# to be able to submit to a
partition? So far reading the gres documentation doesn't seem to have
yielded any word on this issue specifically.

Regards,

-- 

Willy Markuske

HPC Systems Engineer

        

Research Data Services

P: (858) 246-5593

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to