Hello,

I'm sure that this question has been asked before. We have recently added some 
GPU nodes to our SLURM cluster. 

There are 10 nodes each providing 2 * Tesla V100-PCIE-16GB cards
There are 10 nodes each providing 4 * GeForce GTX 1080 Ti cards

I'm aware that the simplest way to manage these resources is to probably setup 
one or two partitions. Then users would have exclusive access to each of these 
nodes. 

Alternatively, I suspect it's possible to manage all these nodes using a single 
partition and additionally to allow users to submit multiple jobs to these 
nodes (let's say they wish to use just one GPU card in a job, for example). 
Then I would guess that we would have to provide a gres.conf on each of the GPU 
nodes, (and additionally enable users to use the SLURM "feature" option to 
specify the card type). The gres.conf file could, presumably, be configured to 
specify the number and type of GPU cards on each node so that the users could 
then request the number/type of GPUs without the feature option. 

Also, I suspect we will not want the default QOS to apply to the GPU nodes. I'm 
not sure if there is a clever way of to specify certain user limits on the 
partition definition rather than define another QOS. 

Any help or tips on getting the configuration started -- so that the user 
interface is not too complex -- would be really appreciated, please.

Best regards,
David

Reply via email to