In the Grid Engine world, we've worked around some of the resource fragmentation issues by assigning static sequence numbers to queue instances (a node publishing resources to a queue) and then having the scheduler fill nodes by sequence number rather than spreading jobs across the cluster. This leaves some nodes free of jobs unless a really big job comes in that requires entire nodes.
Since we're a bioinformatics shop, most of our jobs aren't parallel, though a few job types require lots of memory (we have a handful of nodes in the 1TB-4TB RAM range). Grid Engine lets us isolate jobs from each other using cgroups, where a job resource request is translated directly to the resource (memory, CPU, etc.) limits of a cgroup. On Fri, Jan 17, 2020 at 08:44:14AM +0000, Tim Cutts wrote: > Indeed, and you can quite easily get into a “boulders and sand” > scheduling problem; if you allow the small interactive jobs (the sand) > free access to everything, the scheduler tends to find them easy to > schedule, partially fills nodes with them, and then finds it can’t find > contiguous resources large enough for the big parallel jobs (the > boulders), and you end up with the large batch jobs pending forever. > > I’ve tried various approaches to this in the past; for example > pre-emption of large long running jobs, but that causes resource > starvation (suspended jobs are still consuming virtual memory) and then > all sorts of issues with timeouts on TCP connections and so on and so > forth, these being genomics jobs with lots of not-normal-HPC activities > like talking to relational databases etc. > > I think you always end up having to ring-fence hardware for the large > parallel batch jobs, and not allow the interactive stuff on it. > > This of course is what leads some users to favour the cloud, because it > appears to be infinite, and so the problem appears to go away. But > let's not get into that argument here. > > Tim > > On 16 Jan 2020, at 23:50, Alex Chekholko via Beowulf > <[1]beowulf@beowulf.org> wrote: > > Hey Jim, > There is an inverse relationship between latency and throughput. Most > supercomputing centers aim to keep their overall utilization high, so > the queue always needs to be full of jobs. > If you can have 1000 nodes always idle and available, then your 1000 > node jobs will usually take 10 seconds. But your overall utilization > will be in the low single digit percent or worse. > Regards, > Alex > On Thu, Jan 16, 2020 at 3:25 PM Lux, Jim (US 337K) via Beowulf > <[2]beowulf@beowulf.org> wrote: > > Are there any references out there that discuss the tradeoffs between > interactive and batch scheduling (perhaps some from the 60s and 70s?) – > > Most big HPC systems have a mix of giant jobs and smaller ones managed > by some process like PBS or SLURM, with queues of various sized jobs. > > > What I’m interested in is the idea of jobs that, if spread across many > nodes (dozens) can complete in seconds (<1 minute) providing > essentially “interactive” access, in the context of large jobs taking > days to complete. It’s not clear to me that the current schedulers > can actually do this – rather, they allocate M of N nodes to a > particular job pulled out of a series of queues, and that job “owns” > the nodes until it completes. Smaller jobs get run on (M-1) of the N > nodes, and presumably complete faster, so it works down through the > queue quicker, but ultimately, if you have a job that would take, say, > 10 seconds on 1000 nodes, it’s going to take 20 minutes on 10 nodes. > > > Jim > > > > -- > > > _______________________________________________ > Beowulf mailing list, [3]Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > [4]https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > [beowulf.org] > > _______________________________________________ > Beowulf mailing list, [5]Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > [6]https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi > -2Dbin_mailman_listinfo_beowulf&d=DwIGaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm > 8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jU > X3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm > 2s51IZb0H8o&e= > > -- The Wellcome Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > References > > 1. mailto:beowulf@beowulf.org > 2. mailto:beowulf@beowulf.org > 3. mailto:Beowulf@beowulf.org > 4. > https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi-2Dbin_mailman_listinfo_beowulf&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jUX3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm2s51IZb0H8o&e= > 5. mailto:Beowulf@beowulf.org > 6. > https://urldefense.proofpoint.com/v2/url?u=https-3A__beowulf.org_cgi-2Dbin_mailman_listinfo_beowulf&d=DwIGaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=xK7X4jUX3oG8IizF_lTh0GNrYM4sF9nUCxNKq6vi97c&s=rnNXVoLqTeEFVWB-0Jr0hJC0BgpH2_jm2s51IZb0H8o&e= > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -- Skylar _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf