Hey everyone,
Perhaps I am asking a basic question, but I really dont understand how the preemption works.
The scenario(simplified for the example) is like this:

Nodes:
NodeName=A1  CPUS=2 RealMemory=128906 TmpDisk=117172
NodeName=A2  CPUS=30 RealMemory=128906 TmpDisk=117172 Gres=gpu:3

Partitions:
PartitionName=lab1 Nodes=A2 QOS=lab Default=No State=UP
PartitionName=all Nodes=A2,A1 QOS=normal Default=Yes State=UP

Users:
u1 : qos=lab
u2: qos=normal

commands(in this order):
u2: srun  --gres=gpu:2 --pty bash
u1: srun  --gres=gpu:2 --pty bash

result
squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %Q"

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) PRIORITY
               318             lab1                 bash            u1    PD       0:00      1                 (Resources)                   101177
               317             all                   bash             u2     R         0:21      1                     A2                                 20


As you can see u1 didnt get his resources because(I believe) qos cannot preempt another qos which run on different partition, oven though they use the same resources.

How should i configure the cluster so that all users with specific qos(lab), can suspend jobs in all other qos(not lab) for specific partition(lab1)?



sacctmgr show qos
Name    Priority    GraceTime    Preempt    PreemptMode
lab1       1000           00:01:00         normal     suspend
normal   0               00:00:00                               


slurm.conf:


PreemptType=preempt/qos
PreemptMode=suspend,gang

PriorityType=priority/multifactor
PriorityDecayHalfLife=30-0
PriorityMaxAge=10000
PriorityWeightFairshare=10000
PriorityWeightQOS=100000

AccountingStorageEnforce=associations,limits,qos



Thanks in advance, Nadav

Reply via email to