[slurm-users] QOS Verification and Management

Jason Simms Wed, 20 Jan 2021 12:17:32 -0800

Dear all,

I have two users on our cluster who "bought into" it, much like a condo
model, by purchasing one single physical node each. For those users, I have
attempted to configure two QOS levels, such that when they submit jobs and
invoke the QOS, they will have preempt, priority access to resources up to
the amount provided in the nodes they purchased. When they are not using
those resources, I want them to be available to any user on the system.


Apparently, this is not working as designed. One user who should have had
priority was waiting for resources for over 10 minutes on a super small,
simple job; based on expected behavior, she should have preempted a running
job and more or less started immediately. Something might be wrong in my
configuration, and I'd also welcome other thoughts for how to test to
ensure the QOS is working properly. Here's what I have in slurm.conf:

PreemptType=preempt/qos
PreemptMode=REQUEUE

The list of QOS seems reasonable:

[simmsj@hpc ~]$ sacctmgr show qos format=name,priority
      Name   Priority
---------- ----------
    normal          0
hendricks+        100
douglaslab        100

And here is the sample job invocation that was stuck in the queue:

srun -t 45 --cpus-per-task=1 --mem-per-cpu=1gb --qos=douglaslab --pty
/bin/bash

And here is how I created the QOS in the first place:

sacctmgr add qos douglaslab
sacctmgr modify qos douglaslab set priority=100
sacctmgr modify qos douglaslab set GrpCPUs=24  #because the node has 24 CPUs

Also, I verified that the appropriate users are the QOS groups. In the end,
then, I assume that if members of the douglaslab QOS submit jobs that
require up to 24 CPUs, then they can preempt running jobs to get that many
if they are not immediately available. But this doesn't seem to be working,
as noted. Any advice would be welcome!

Warmest regards,
Jason

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

[slurm-users] QOS Verification and Management

Reply via email to