Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

Prentice Bisbal Mon, 11 Jun 2018 11:13:12 -0700

Chris,

I'm dealing with this problem myself right now. We use Slurm here. Wereally have one large, very heterogeneous cluster that's treated asmultiple smaller clusters through creating multiple partitions, eachwith their own QOS. We also have some users who don't understand thedifference between -n and -N when specifying a job size. This has leadto jobs specified with -N to stay in the queue for an unusually longtime. Yes, part of the solution is definitely user education, but there are still times when a user user should required nodes and not tasks(using OpenMP within a node, etc.)

Here's how I'm going to tackle this problem: Most of our nodes are32-cores, but some older nodes still in use are 16-core, so we're goingto make sure that jobs going to our larger partitions request a multipleof 16 tasks. That way, a job will either occupy whole nodes, or leave1/2 a node available.

We have one partition meant for single-node or smaller jobs. Thatpartition has only Ethernet, since it shouldn't be supporting inter-nodejobs. On that partition, jobs can use 16-cores or less.

I to make this work, I will be using job_submit.lua to apply this logicand assign a job to a partition. If a user requests a specific partitionnot in line with these specifications, job_submit.lua will reassign thejob to the appropriate QOS.

I'll be happy to share how this works after it's been in place for a fewmonths.



On 06/08/2018 03:21 AM, Chris Samuel wrote:

Hi all,

I'm curious to know what/how/where/if sites do to try and reduce the impact of
fragmentation of resources by small/narrow jobs on systems where you also have
to cope with large/wide parallel jobs?

For my purposes a small/narrow job is anything that will fit on one node
(whether a single core job, multi-threaded or MPI).

One thing we're considering is to use overlapping partitions in Slurm to have
a subset of nodes that are available to these types of jobs and then have
large parallel jobs use a partition that can access any node.

This has the added benefit of letting us set a higher priority on that
partition to let Slurm try and place those jobs first, before smaller ones.

We're already using a similar scheme for GPU jobs where they get put into a
partition that can access all 36 cores on a node whereas non-GPU jobs get put
into a partition that can only access 32 cores on a node, so effectively we
reserve 4 cores a node for GPU jobs.

But really I'm curious to know what people do about this, or do you not worry
about it at all and just let the scheduler do its best?

All the best,
Chris


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

Reply via email to