Yes, we do have time limits set on partitions- 7 days maximum, 3 days default. In this case, the larger job is requesting 3 days of walltime, the smaller jobs are requesting 7.
Thanks M On Wed, Feb 27, 2019 at 12:41 PM Andy Riebs <andy.ri...@hpe.com> wrote: > Michael, are you setting time limits for the jobs? That's a huge part of a > scheduler's decision about whether another job can be run. For example, if > a job is running with the Slurm default of "infinite," the scheduler will > likely decide that jobs that will fit in the remaining nodes will be able > to finish before the job that requires infinite time. > > Andy > > ------------------------------ > *From:* Michael Gutteridge <michael.gutteri...@gmail.com> > <michael.gutteri...@gmail.com> > *Sent:* Wednesday, February 27, 2019 3:29PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > <slurm-users@lists.schedmd.com> > *Cc:* > *Subject:* [slurm-users] Large job starvation on cloud cluster > I've run into a problem with a cluster we've got in a cloud provider- > hoping someone might have some advice. > > The problem is that I've got a circumstance where large jobs _never_ > start... or more correctly, that large-er jobs don't start when there are > many smaller jobs in the partition. In this cluster, accounts are limited > to 300 cores. One user has submitted a couple thousand jobs that each use > 6 cores. These queue up, start nodes, and eventually all 300 cores in the > limit are busy and the remaining jobs are held with "AssocGrpCpuLimit". > All as expected. > > Then another user submits a job requesting 16 cores. This one, too, gets > held with the same reason. However, that larger job never starts even if > it has the highest priority of jobs in this account (I've set it manually > and by using nice). > > What I see in the sched.log is: > > sched: [2019-02-25T16:00:14.940] Running job scheduler > sched: [2019-02-25T16:00:14.941] JobId=2210784 delayed for accounting > policy > sched: [2019-02-25T16:00:14.942] JobId=2203130 initiated > sched: [2019-02-25T16:00:14.942] Allocate JobId=2203130 NodeList=node1 > #CPUs=6 Partition=largenode > > In this case, 2210784 is the job requesting 16 cores and 2203130 is one of > the six core jobs. This seems to happen with either the backfill or > builtin scheduler. I suspect what's happening is that when one of the > smaller jobs completes, the scheduler first looks at the higher-priority > large job, determines that it cannot run because of the constraint, looks > at the next job in the list, determines that it can run without exceeding > the limit, and then starts that job. In this way, the larger job isn't > started until all of these smaller jobs complete. > > I thought that switching to the builtin scheduler would fix this, but as > slurm.conf(5) indicates: > > > An exception is made for jobs that can not run due > > to partition constraints (e.g. the time limit) or > > down/drained nodes. In that case, lower priority > > jobs can be initiated and not impact the higher > > priority job. > > I suspect one of these exceptions is being triggered- the limit is in the > job's association, so I don't think it's a partition constraint. I don't > have this problem with the on-premises cluster, so I suspect it's something > to do with power management and the state of powered-down nodes. > > I've sort-of worked around this by setting a per-user limit lower than the > per-account limit, but that doesn't address any situation where a single > user submits large and small jobs and does lead to some other problems in > other groups, so it's not a long-term solution. > > Thanks for having a look > > - Michael > > >