Re: [slurm-users] Large job starvation on cloud cluster

Andy Riebs Wed, 27 Feb 2019 12:41:36 -0800

Michael, are you setting time limits for the jobs? That's a huge part ofa scheduler's decision about whether another job can be run. Forexample, if a job is running with the Slurm default of "infinite," thescheduler will likely decide that jobs that will fit in the remainingnodes will be able to finish before the job that requires infinite time.


Andy


------------------------------------------------------------------------
*From:* Michael Gutteridge <michael.gutteri...@gmail.com>
*Sent:* Wednesday, February 27, 2019 3:29PM
*To:* Slurm User Community List <slurm-users@lists.schedmd.com>
*Cc:*
*Subject:* [slurm-users] Large job starvation on cloud cluster

I've run into a problem with a cluster we've got in a cloud provider-hoping someone might have some advice.

The problem is that I've got a circumstance where large jobs _never_start... or more correctly, that large-er jobs don't start when thereare many smaller jobs in the partition. In this cluster, accounts arelimited to 300 cores. One user has submitted a couple thousand jobsthat each use 6 cores. These queue up, start nodes, and eventually all300 cores in the limit are busy and the remaining jobs are held with"AssocGrpCpuLimit". All as expected.

Then another user submits a job requesting 16 cores. This one, too,gets held with the same reason. However, that larger job never startseven if it has the highest priority of jobs in this account (I've set itmanually and by using nice).


What I see in the sched.log is:

sched: [2019-02-25T16:00:14.940] Running job scheduler
sched: [2019-02-25T16:00:14.941] JobId=2210784 delayed for accounting policy
sched: [2019-02-25T16:00:14.942] JobId=2203130 initiated

sched: [2019-02-25T16:00:14.942] Allocate JobId=2203130 NodeList=node1#CPUs=6 Partition=largenode

In this case, 2210784 is the job requesting 16 cores and 2203130 is oneof the six core jobs. This seems to happen with either the backfill orbuiltin scheduler. I suspect what's happening is that when one of thesmaller jobs completes, the scheduler first looks at the higher-prioritylarge job, determines that it cannot run because of the constraint,looks at the next job in the list, determines that it can run withoutexceeding the limit, and then starts that job. In this way, the largerjob isn't started until all of these smaller jobs complete.

I thought that switching to the builtin scheduler would fix this, but asslurm.conf(5) indicates:


> An exception is made for jobs that can not run due

to  partition constraints (e.g. the time limit) or
 down/drained nodes.  In that case, lower priority
jobs can  be initiated and not impact the higher
priority  job.

I suspect one of these exceptions is being triggered- the limit is inthe job's association, so I don't think it's a partition constraint. Idon't have this problem with the on-premises cluster, so I suspect it'ssomething to do with power management and the state of powered-down nodes.

I've sort-of worked around this by setting a per-user limit lower thanthe per-account limit, but that doesn't address any situation where asingle user submits large and small jobs and does lead to some otherproblems in other groups, so it's not a long-term solution.


Thanks for having a look

 - Michael

Re: [slurm-users] Large job starvation on cloud cluster

Reply via email to