BLUF:
     Is the Nvidia MPS service required for the MPS gres to function in slurm 
with multiple GPUs in a single machine? (jobs using MPS don't need to span 
GPUs, just use a part of a GPU in a machine with multiple GPUs)
     Is there more detailed documentation available on how MPS should be set up 
and how it functions?


I'm playing with mps on a test machine and the documentation at 
https://slurm.schedmd.com/gres.html seems a bit vague. It implies it can be 
used across multiple GPUs, but then states that only one GPU per node may be 
configured for use with MPS.

When I test mps in slurm without the NVIDIA MPS service  (I am just starting to 
read up on the NVIDIA MPS service now) it does seem to only use one GPU.

In gres.conf
     NodeName=testmachine1 Name=gpu File=/dev/nvidia[0-1]
     NodeName=testmachine1 Name=mps count=200 File=/dev/nvidia[0-1]

In slurm.conf
     NodeName=testmachine1 Gres=gpu:2,mps:200 Sockets=1 CoresPerSocket=6

An array job posted with "-gres=mps:50" will put two job steps on the first 
GPU, but doesn't use the second GPU for mps jobs.

Is the Nvidia MPS service required for the MPS gres to function in slurm?
Is there more detailed documentation available on how MPS should be set up and 
how it functions?

We have a mixed set of work (shared GPU using 1 CPU core and a small percentage 
of one GPU versus dedicated GPU jobs using a whole number of GPUs and CPUs) on 
machines with 4 GPUs and it would be nice to have them co-exist instead of 
splitting the machines into two separate partitions for the two styles of jobs.

Thanks.

Reply via email to