Re: [slurm-users] Excessive use of backfill on a cluster

Bjørn-Helge Mevik Tue, 20 Nov 2018 05:53:34 -0800

It might be unrelated, but I remember we had some similar problems when
setting up a new cluster two years ago.  I don't remember the details,
but I believe it was related to qos'es overriding partition limits.
Jobs in these qos'es (with requests that exceeded a partition limit like
the minimum number of nodes) were started fine by the backfiller, but
not by the scheduler.  It turned out that the checks for this was ok in
the backfiller, but had a bug in the scheduler.  The scheduler bug was
fixed in this particular case, but it might be that you are hit by
something similar.


(This was a while ago, and I do remember someone at SchedMD mentioning
that they were going to "de-duplicate" the scheduler and backfiller code
in the future, but I don't know how far they've gotten with it.)

-- 
Regads,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

signature.asc
Description: PGP signature

Re: [slurm-users] Excessive use of backfill on a cluster

Reply via email to