Hi Rodrigo,
We indeed have overlooked this. The problem is that in general our jobs
need more than 2 days of resources, that's why we select a wall time in
the batch scripts equal to the max wall time allowed by the partition.
One thing we could try is to set the wall time at ~46h for the "light"
jobs in the batch scripts and let 48h for the "heavy" jobs, this way not
all jobs will have the same time limit.
Configuring node list for "light" and "heavy" jobs could do the trick. 2
things that could probably be a problem then are (i) even "heavy" jobs
having very low Priority would have access to resources at the expense
of "light" jobs with higher priority and (ii) regular intervention would
be needed. But maybe there is no other solution.
I thank you a lot for your inputs !
Best,
Jeremy
Am 2022-01-19 01:46, schrieb Rodrigo Santibáñez:
Hi Jeremy,
If all jobs have the same time limit, backfill is impossible. The
documentation says: "Effectiveness of backfill scheduling is dependent
upon users specifying job time limits, otherwise all jobs will have the
same time limit and backfilling is impossible". I don't know to
overcome that...
However, without changing SchedulerType, you could hold pending jobs
except for the job you want to execute, then release all jobs when the
desired job is allocated. Also, you could define a node or list of
nodes available for all jobs excluding nodes for the job of interest,
then remove the configuration when the latter is allocated. I preferred
to do the second because the "heavy" job and the "light" jobs will be
allocated, and I have not to be aware of the queue outside office hours
(Again, easier to do in a low utilized cluster).
About "PLANNED", I wasn't aware, and it is a feature of SLURM 21.08.
Could be that why you don't see it in your cluster?
Best,
On Mon, Jan 17, 2022 at 2:02 PM Jérémy Lapierre
<jeremy.lapie...@uni-saarland.de> wrote:
Hi Rodrigo and Rémi,
I had a similar behavior a long time ago, and I decided to set
SchedulerType=sched/builtin to empty X
nodes of jobs and execute that high-priority job requesting more than
one node. It is not ideal, but the
cluster has low load, so a user that requests more than one node
doesn't delay too much the execution
of other's jobs.
I don't think this would be ideal in our case as we have heavy loads.
Also I'm not sure if you mean that we should switch to
SchedulerType=sched/builtin permanently or just the time needed for
the jobs causing problem to be allocated ? Also we have some other
experiences on another cluster and slurm should normally reserve
resources we think.
Backfilling doesn't delay the scheduled start time of higher priority
jobs,
but at least they must have a scheduled start time.
Did you check the start time of your job pending with Resources
reason? eg.
with `scontrol show job <id> | grep StartTime`.
Yes, the scheduled start time have been checked as well, and this time
is updated through time such that jobs asking for 1/4 of a node can
run on a freshly-free-1/4th-node. This is why I'm saying that the jobs
asking for several nodes (tested with 2 nodes here) are pending
forever. It is like slurm never wants to have unused resources (which
also makes sense, but how can we satisfy "heavy" resources request
then ?). On another cluster using slurm, I know that slurm reserves
nodes and the node state of those reserved nodes becomes "PLANNED" (or
plnd), this way jobs requesting for more resources than available at
the time of submission can later be satisfied. This never happens on
the cluster which is causing issues.
Sometimes Slurm is unable to define the start time of a pending job.
One
typical reason is the absence of timelimit on the running jobs.
In t his case Slurm is unable to define when the running jobs are
over,
when the next highest priority job can start and eventually unable to
define
if lower priority jobs actually delay higher priority jobs.
Yes we always set up the time limit of our jobs to the max time limit
allowed by the partition.
Thanks for your help,
Jeremy