Hi,

I struggle with configuring job preempting. I have nodes with 8 Nvidia A100 GPUs. I have two partitions: short (lower priority) and sfglab (higher priority). I want to allow higher priority jobs to preempt (REQUEUE mode) lower priority job. It looks like it works, however it works too good.

Job from higher priority partition preempts entire host instead of only single job which would be enough to release resources for higher priority partition. Whats more it lock the rest of resources until high-prio job will end. What am I doing wrong?

Here is example:

$ srun --test-only -G1 -c1 --mem 1M -p sfglab
srun: Job 501151 to start at 2023-01-17T12:46:01 using 1 processors on nodes dgx-1 in partition sfglab
srun:   Preempts: 363278,501001,501029,501075,501076,501077,501120,501121

To release these resources it would be enough to preempt one job instead of all.


Here is my config:

slurm.conf

(...)

DefMemPerCPU            = 100
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
PreemptMode             = REQUEUE
PreemptType             = preempt/partition_prio
PreemptExemptTime       = 00:00:00
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CORE_MEMORY

(...)

PartitionName=short Nodes=dgx-[1-4],sr-[1-3] MaxTime=1-0 State=UP PriorityTier=10000 Default=YES DefaultTime=0-01:00:00 OverSubscribe=NO PreemptMode=requeue

PartitionName=sfglab Nodes=dgx-1 MaxTime=10-0 State=UP PriorityTier=20000 PreemptMode=off OverSubscribe=NO AllowAccounts=sfglab

--
best regards | pozdrawiam serdecznie
*Michał Kadlof*
Head of the high performance computing center
Eden^N cluster administrator
Faculty of Mathematics and Computer Science
Warsaw University of Technology

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to