Hello!

We are having an issue with high priority gpu jobs blocking low priority cpu 
only jobs.

Our cluster is setup with one partition, "all". All nodes reside in this 
cluster. In this all partition we have four generations of compute nodes, 
including gpu nodes. We do this to make use of those unused cores on the gpu 
nodes for compute only jobs. We schedule the various different generations, and 
gpu nodes by the user specifying a constraint (if they care), and a --qos=gpu / 
--gres=gpu:tesla:1 for gpu nodes. The gpu qos will give the jobs the highest 
priority in the queue, so that they can get scheduled sooner onto the limited 
resource that we have in gpu's. So this has worked out real nice for quite some 
time. But lately we've noticed that the gpu jobs are blocking the cpu only 
jobs. Yes, the gpu jobs have higher priority, yet, the gpu jobs can only run on 
a very small subset of nodes compared to the cpu only jobs. But it appears that 
slurm isn't taking into consideration the limited set of nodes the gpu job can 
run on. That’s the only possibility that I see to the gpu jobs blocking the cpu 
only jobs. I'm not sure if this is due to a recent slurm change, or if we just 
never noticed, but its definitely happening.  

For example, the behavior happens in the following scenario

- 15 compute nodes (no gpus) are idle
- All of the gpus are occupied
- 1000's of low priority compute only jobs in the pending queue
- 100's of highest priority gpu jobs in the pending queue

In the above scenario, the above low priority jobs are not backfilled, or 
started, yet compute only nodes remain idle. If I hold the gpu jobs, the lower 
priority compute only jobs are then started.

Anyone seen this? Am I thinking about this wrong? I would think that slurm 
should not be considering the nodes with no gpus to fulfill the gpu jobs.

I have an idea how to fix this scenario, but I think our current config should 
work. The fix I am mulling over is to create a gpu partition, and place the gpu 
nodes into that partition. Then, use the all_partitions job submit plugin to 
schedule compute only jobs into both partitions. The gpu jobs would then only 
land in the gpu partition. I'd think that would definitely fix the issue, but 
maybe there is a down side. Yet, I think how we have it should be working!?

Thanks for your advice!

Best,
Chris 

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

Reply via email to