Hello, I'm quite a newbie regarding Slurm. I recently created a small Slurm instance to manage our GPU resources. I have this situation:
JOBID STATE TIME ACCOUNT PARTITION PRIORITY REASON CPU MIN_MEM TRES_PER_NODE 1739 PENDING 0:00 standard gpu-low 5 Priority 1 80G gres:gpu:a100_1g.10gb:1 1738 PENDING 0:00 standard gpu-low 5 Priority 1 80G gres:gpu:a100-sxm4-80gb:1 1737 PENDING 0:00 standard gpu-low 5 Priority 1 80G gres:gpu:a100-sxm4-80gb:1 1736 PENDING 0:00 standard gpu-low 5 Resources 1 80G gres:gpu:a100-sxm4-80gb:1 1740 PENDING 0:00 standard gpu-low 1 Priority 1 8G gres:gpu:a100_3g.39gb 1735 PENDING 0:00 standard gpu-low 1 Priority 8 64G gres:gpu:a100-sxm4-80gb:1 1596 RUNNING 1-13:26:45 standard gpu-low 3 None 2 64G gres:gpu:a100_1g.10gb:1 1653 RUNNING 21:09:52 standard gpu-low 2 None 1 16G gres:gpu:1 1734 RUNNING 59:52 standard gpu-low 1 None 8 64G gres:gpu:a100-sxm4-80gb:1 1733 RUNNING 1:01:54 standard gpu-low 1 None 8 64G gres:gpu:a100-sxm4-80gb:1 1732 RUNNING 1:02:39 standard gpu-low 1 None 8 40G gres:gpu:a100-sxm4-80gb:1 1731 RUNNING 1:08:28 standard gpu-low 1 None 8 40G gres:gpu:a100-sxm4-80gb:1 1718 RUNNING 10:16:40 standard gpu-low 1 None 2 8G gres:gpu:v100 1630 RUNNING 1-00:21:21 standard gpu-low 1 None 1 30G gres:gpu:a100_3g.39gb 1610 RUNNING 1-09:53:23 standard gpu-low 1 None 2 8G gres:gpu:v100 Job 1736 is in the PENDING state since there are no more available a100-sxm4-80gb GPUs. The job priority starts to rise with time (priority 5) as expected. Now another user submits job 1739 on a gres:gpu:a100_1g.10gb:1 that is available, but the job is not starting since its priority is 1. This is obviously not the desired outcome, and I believe I must change the scheduling strategy. Could someone with more experience than me give me some hints? Thanks, Cristiano