Hello,

I recently changed our slurm.conf file to allow for job preemption. While 
making this change, I also chose to use select/cons_res to try and understand 
how preemption would interact with our future upgrade which will include GPUs.  
The code we will run on the GPUs can only use a single CPU and so I would like 
the GPU nodes to take on two jobs at once. One job would use the GPU and a 
single CPU. The other job would use the remaining CPUs. For instance, if the 
node had 12 cores, it would run a job on the GPU and one CPU, and then another 
job on the remaining 11 cores. Those 11 cores should behave the same as the 
full 16 cores on another machine without a GPU with regards to partitions and 
job preemption.

To me there appears two logical ways of thinking this through. The node with 
the GPU could be viewed as two nodes, and each of these two nodes would be 
assigned a job using select/linear as the SelectType. (I have not seen any 
documentation describing how to do this, so I have not attempted it.) Or the 
resources of the node are individually assigned by the use of select/cons_res. 
(The manual suggests using select/cons_tres for GPUs, but I am told the plugin 
is not found so for now I am attempting with cons_res). Select/cons_res 
introduces a new problem for job preemption even for the simpler nodes which 
should only run one job per node.

It appears that select/cons_res is mostly the same as select/linear except for 
node sharing. Job preemption mostly works as expected, however for the 
partition which uses job preemption, multiple jobs are placed on the same node. 
This partition has OverSubscribe=FORCE:1. So slurm is not oversubscribing the 
CPUs but it is assigning fewer CPUs to the job than intended (probably 1 CPU 
per job). The job itself can also be called with a specified number of CPUs, or 
its own OverSubscribe setting (but the Oversubscribe setting of the job is 
overruled by the OverSubscribe setting of the partition). I can set a specific 
number of CPUs for the job but that is not what is desired. Instead the job 
should use all CPUs besides those reserved for the GPU.

Is it possible to apply select/linear to the portion of the cluster which 
participates in preemption and select/cons_tres to the GPU portion?
Can I use cons_res where a job requests to use all CPUs besides those reserved 
for the GPU?
Is it possible to allocate the GPU using select/linear?

Here are the relevant portions of slurm.conf:

SchedulerType=sched/builtin
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
#
# Node Configurations
#
NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=6400 State=UNKNOWN
NodeName=devel CPUs=32 NodeAddr=workmaster NodeHostname=workmaster
NodeName=pegasus[0-1] CPUs=24 CoresPerSocket=12 ThreadsPerCore=2 
NodeAddr=192.168.250.[5-6] NodeHostname=pegasus[0-1]
NodeName=steve[0-4] CPUs=16 CoresPerSocket=8 ThreadsPerCore=2 
NodeAddr=192.168.250.[7-11] NodeHostname=steve[0-4]

#
# Partition Configurations
#
PartitionName=DEFAULT State=UP
PartitionName=low_pri Nodes=devel,pegasus[0-1],steve[0-4] Default=NO 
PreemptMode=REQUEUE PriorityTier=1 OverSubscribe=Exclusive
PartitionName=med_pri Nodes=devel,pegasus[0-1],steve[0-4] Default=YES 
PreemptMode=REQUEUE PriorityTier=2 OverSubscribe=Exclusive
PartitionName=hi_pri Nodes=devel,pegasus[0-1],steve[0-4] Default=NO 
PreemptMode=OFF PriorityTier=3 OverSubscribe=FORCE:1

Reply via email to