Hi,I struggle with configuring job preempting. I have nodes with 8 Nvidia A100 GPUs. I have two partitions: short (lower priority) and sfglab (higher priority). I want to allow higher priority jobs to preempt (REQUEUE mode) lower priority job. It looks like it works, however it works too good.
Job from higher priority partition preempts entire host instead of only single job which would be enough to release resources for higher priority partition. Whats more it lock the rest of resources until high-prio job will end. What am I doing wrong?
Here is example: $ srun --test-only -G1 -c1 --mem 1M -p sfglabsrun: Job 501151 to start at 2023-01-17T12:46:01 using 1 processors on nodes dgx-1 in partition sfglab
srun: Preempts: 363278,501001,501029,501075,501076,501077,501120,501121To release these resources it would be enough to preempt one job instead of all.
Here is my config: slurm.conf (...) DefMemPerCPU = 100 JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux PreemptMode = REQUEUE PreemptType = preempt/partition_prio PreemptExemptTime = 00:00:00 SelectType = select/cons_tres SelectTypeParameters = CR_CORE_MEMORY (...)PartitionName=short Nodes=dgx-[1-4],sr-[1-3] MaxTime=1-0 State=UP PriorityTier=10000 Default=YES DefaultTime=0-01:00:00 OverSubscribe=NO PreemptMode=requeue
PartitionName=sfglab Nodes=dgx-1 MaxTime=10-0 State=UP PriorityTier=20000 PreemptMode=off OverSubscribe=NO AllowAccounts=sfglab
-- best regards | pozdrawiam serdecznie *Michał Kadlof* Head of the high performance computing center Eden^N cluster administrator Faculty of Mathematics and Computer Science Warsaw University of Technology
smime.p7s
Description: S/MIME Cryptographic Signature