Hello all,

We're mostly a GPU compute shop, and we've been happy with slurm for the last three years, but we think slurm would benefit from the following two features:

1. Allow preemption in the same QOS, all else being equal, based on job priority.

2. Job size calculation to take into account the number of GPUs allocated to the job. In a GPU cluster the most valuable currency being the GPU, not the CPU. Perhaps even parameterize the job size so the user could choose what to emphasize in calculation: cpu, gpu, memory.


If this is not the right place to ask for this, I would appreciate a pointer in the right direction.


Justification:

It's pretty obvious why we'd like #2.

We want #1 because we believe it would allow for a more natural maximization of the cluster usage. A user X could grab the whole cluster if it's free, while another user Y, arriving later could get jobs in by preempting some of the jobs of X. We're assuming the fairshare score of user X will decrease as resources are consumed, and Y's jobs will have a higher priority. We also assume that requeue, checkpoint and restart are employed. We also think that this would make the system more fair in the long term, essentially time slicing usage through preemption based on priority.

Relu


Reply via email to