Here's how we handle this here:
Create a separate partition named debug that also contains that node.
Give the debug partition a very short timelimit, say 30 - 60 minutes.
Long enough for debugging, but too short to do any real work. Make the
priority of the debug partition much higher than t
We put a ‘gpu’ QOS on all our GPU partitions, and limit jobs per user to 8 (our
GPU capacity) via MaxJobsPerUser. Extra jobs get blocked, allowing other users
to queue jobs ahead of the extras.
# sacctmgr show qos gpu format=name,maxjobspu
Name MaxJobsPU
-- -
gpu