It might be unrelated, but I remember we had some similar problems when setting up a new cluster two years ago. I don't remember the details, but I believe it was related to qos'es overriding partition limits. Jobs in these qos'es (with requests that exceeded a partition limit like the minimum number of nodes) were started fine by the backfiller, but not by the scheduler. It turned out that the checks for this was ok in the backfiller, but had a bug in the scheduler. The scheduler bug was fixed in this particular case, but it might be that you are hit by something similar.
(This was a while ago, and I do remember someone at SchedMD mentioning that they were going to "de-duplicate" the scheduler and backfiller code in the future, but I don't know how far they've gotten with it.) -- Regads, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
signature.asc
Description: PGP signature