Good morning. 
I’m wondering if one could point me in the right direction to fulfill a request 
on one of our small clusters.

Cluster info:
 * 5 nodes with 4 gpus/28 cpus each node. 
 * User 1 only will submit to cpus, all other 8 users will submit to gpus
 * only one account in the database with 9 users.

* All users should be able to run on all cpus or gpus in the cluster at a time 
*if* the queue is empty. (max_jobs_per_user: 20 on gpus/120 on cpus)
* If there is a wait queue, a maximum job_per_user should be set to 10 for gpu 
requests and 60 for cpu requests.
* owner does NOT want the limits set to how many processors/gpus a user can use 
at a time. 
* The user that may have 10 jobs running and 100 in the wait queue only has 
their job run once one of their 10 has ended. (i.e. if one of their 10 jobs 
ends, another user’s job in the queue does not begin)

* My dilema is the part that “Max jobs per user should be 20 on gpus and 120 on 
cpus” if the queue is empty. and “Max jobs per user should be 10 on gpus and 60 
on cpus” if the  queue is not empty. 


I’ve gone round and round of which path I should be going down:

  * separate partitions (one for gpu and one for cpus) and set limits by 
partition in the slurm.conf
  * QOS for max job limits (a qos such as low, normal, high is not necessary); 
I would only want the qos for max limits (?) 
  * one partition that has the fairshare strictly handle this with  something 
like:   TRESBillingWeights="CPU=1.0,Mem=0.25G,gres/gpu=20”



Any advice to which path(s) to go down to get this solution would be greatly 
appreciated!!!
Jodie


Jodie Sprouse
Systems Administrator
Cornell University 
Center for Advanced Computing
Ithaca, NY 14850
jh...@cornell.edu

Reply via email to