Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

Bill Abbott Fri, 08 Jun 2018 07:39:57 -0700

We set PriorityFavorSmall=NO and PriorityWeightJobSize to someappropriately large value in slurm.conf, which helps.

We also used to limit the number of total jobs a single user could runto something like 30% of the cluster, so a user could run a single mpijob that takes all nodes, but couldn't run single-core jobs that takeall nodes. We switched away from that to a owner/preemption system.Now if a user pays for access they can run whatever they want on theirallocation, and if they don't pay we don't have to care what happens tothem. Sort of.

One idea we're working towards is to have a vm cluster in one of thecommercial cloud providers that only accepts small jobs, and use slurmfederation to steer the smaller jobs there, leaving the on-prem nodesfor big mpi jobs. We're not there yet but shouldn't be a problem toimplement technically.


Bill

On 06/08/2018 10:16 AM, Paul Edmon wrote:

Yeah this one is tricky. In general we take the wildwest approach here,but I've had users use --contiguous and their job takes forever to run.
I suppose one method would would be enforce that each job take a fullnode and parallel jobs always have contiguous. As I recall Slurm willpreferentially fill up nodes to try to leave as large of contiguousblocks as it can.
The other other option would be to use requeue to your advantage.Namely just have a high priority queue only for large contiguous jobsand it just requeues all the jobs it needs to to run. That would dependon your single node/core users tolerances for being requeued.
-Paul Edmon-


On 06/08/2018 03:55 AM, John Hearns via Beowulf wrote:
Chris, good question. I can't give a direct asnwer there, but let meshare my experiences.
In the past I managed SGI ICE clusters and a large memory UV systemwith PBSPro queuing.The engineers submitted CFD solver jobs using scripts, and we onlyallowed them to use a multiple of N cpus,in fact there were queues named after lets say 2N or 4N cpu cores. Thenumber of cores were cunningly arranged to fit into
what SGI term an IRU, or everyone else would call a blade chassis.
We had job exclusivity, and engineers were not allowed to choose howmany CPUs they used.This is a very efficient way to run HPC - as you have a clear view ofhow many jobs fit on a cluster.
Yes, before you say it this does not cater for the mixed workload withlots of single CPU jobs, Matlab, Python etc....
When the UV arrived I configured bladesets (placement sets) such thatthe scheduler tried to allocate CPUs and memory from bladesadjacent to each other. Again much better efficiency. If I'm not wrongyou do that in Slurm by defining switches.
When the high core count AMDs came along again I configured blade setsand the number of CPUs per job was increased to cope withlarger core count CPUs but again cunningly arranged to equal thenumber of cores in a placement set (placement sets were configured to be
half, full or two IRUs)
At another place of employment recently we had a hugely mixedworkload, ranging from interactive graphics, to the Matlab type jobs,to multinode CFD jobs. In addition to that we had different CPUgenerations and GPUs in the mix.That setup was a lot harder to manage and keep up the efficiency ofuse, as you can imagine.
I agree with you about the overlapping partitions. If I was to arrangethings in my ideal worls, I would have a set of the latest generation CPUsusing the latest generation interconnect and reserve them for 'jobexclusive' jobs - ie parallel jobs, and leave other nodes exclusively for
one node or one core jobs.
Then have some mechanism to grow/shrink the partitions.
Ont thing again which I found difficult in my last job was users 'hardwiring' the number of CPUs they use. In fact I have seen that quiteoften on other projects.What happens is that a new Phd or Postdoc or new engineer is gifted ajob submission script from someone who is leaving, or moving on.The new person doesnt really understand why (say) six nodes with eightCPU cores are requested.But (a) they just want to get on and do the job (b) they are scared ofbreaking things by altering the script.So the number of CPUs doesnt change and with the latest generation20plus cores on a node you get wasted cores.Also having mixed generations of CPUs with different core counts doesnot help here.
Yes I know we as HPC admins can easily adjust job scripts to mpirunwith N equal to the number of cores on a node (etc).In fact when I have worked with users and showed them how to do thisit has been a source of satisfaction to me.
On 8 June 2018 at 09:21, Chris Samuel <ch...@csamuel.org<mailto:ch...@csamuel.org>> wrote:
    Hi all,

    I'm curious to know what/how/where/if sites do to try and reduce
    the impact of
    fragmentation of resources by small/narrow jobs on systems where
    you also have
    to cope with large/wide parallel jobs?

    For my purposes a small/narrow job is anything that will fit on
    one node
    (whether a single core job, multi-threaded or MPI).

    One thing we're considering is to use overlapping partitions in
    Slurm to have
    a subset of nodes that are available to these types of jobs and
    then have
    large parallel jobs use a partition that can access any node.

    This has the added benefit of letting us set a higher priority on
    that
    partition to let Slurm try and place those jobs first, before
    smaller ones.

    We're already using a similar scheme for GPU jobs where they get
    put into a
    partition that can access all 36 cores on a node whereas non-GPU
    jobs get put
    into a partition that can only access 32 cores on a node, so
    effectively we
    reserve 4 cores a node for GPU jobs.

    But really I'm curious to know what people do about this, or do
    you not worry
    about it at all and just let the scheduler do its best?

    All the best,
    Chris
-- Chris Samuel : http://www.csamuel.org/<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=02%7C01%7Cbabbott%40rutgers.edu%7C82085187da2741e11dff08d5cd4a7338%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636640642010316016&sdata=r%2B1Nv6vo2JDzj8fiIl6vhUIIE90AfgFpf151p2MZPvY%3D&reserved=0>: Melbourne, VIC
    _______________________________________________
    Beowulf mailing list, Beowulf@beowulf.org
    <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit
    http://www.beowulf.org/mailman/listinfo/beowulf
    
<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cbabbott%40rutgers.edu%7C82085187da2741e11dff08d5cd4a7338%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636640642010316016&sdata=o1QGNNA0yxJAQ%2BumiFhbrh6HYb%2FPH0mpekDPlc809pI%3D&reserved=0>




_______________________________________________
Beowulf mailing list,Beowulf@beowulf.org  sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) 
visithttp://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cbabbott%40rutgers.edu%7C82085187da2741e11dff08d5cd4a7338%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636640642010472260&sdata=Jxm0xyKz%2FZeSeYGCPGCZWkGrv%2FtjgplbbI%2BUeDljU%2BM%3D&reserved=0

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

Reply via email to