Re: [slurm-users] Tuning MaxJobs and MaxJobsSubmit per user and for the whole cluster?

Paul Edmon Mon, 10 Aug 2020 16:50:22 -0700

Yeah, I imagine that this varies also depending on average length ofjob. In our case we do about a job per core per day. So the double thenumber of cores works out well for us. However if you have higher turnover a high number of jobs permitted is wiser.

We have a ton of partitions (130 as of last count) so our tuning hasbeen a bit more complicated. However the latest version of slurm(20.02) vastly improved the backfill efficiency which has helped withmaking sure the cluster is full. Nonetheless we still seem to average ajob per core per day here.



-Paul Edmon-


On 8/10/2020 4:43 PM, Sebastian T Smith wrote:

My rule of thumb for our cluster is 1,024 jobs/node. Our nodes have32 cores, so we're 32x core count (converting to Paul's units). Wehave 120 nodes with a maximum of 122,880 jobs.
At a high-level, nodes are allocated to different partitions and eachpartition is allocated a maximum number of jobs equal to 1024 *num_nodes (reality isn't quite this simple). Our largest partitionfeatures 54,272 max jobs (53 nodes). I've seen this maxed out anumber of times with a large number of very short jobs, and with jobarrays.
This setup has required a bit of tuning. Adjustingsched_max_job_start and sched_min_interval has been sufficient to keepSlurm responsive when users are submitting or cancelling a largenumber of jobs. Backfill tuning has been difficult because we madesome poor decisions setting DefaultTime at the epoch of our system. Overall performance has been excellent after minimal tuning.
- Sebastian

--

University of Nevada, Reno <http://www.unr.edu/>          
*Sebastian Smith
*High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

*work-phone:*775-682-5050 <tel:7756825050>
***email:*stsm...@unr.edu <mailto:stsm...@unr.edu>
*website:*http://rc.unr.edu <http://rc.unr.edu/>

------------------------------------------------------------------------
*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalfof Paul Edmon <ped...@cfa.harvard.edu>
*Sent:* Friday, August 7, 2020 6:22 AM
*To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
*Subject:* Re: [slurm-users] Tuning MaxJobs and MaxJobsSubmit per userand for the whole cluster?
My rule of thumb is that the MaxJobs for the entire cluster is twicethe number of cores you have available. That way you have enough jobsrunning to fill all the cores and enough jobs pending to refill them.
As for per user MaxJobs, it just depends on what you think the maximumnumber any user can run with out causing damage to themselves, theunderlying filesystems, and interfering with other users. Practicalexperience has lead to us setting that limit to be 10,000 on ourcluster, but I imagine it will vary from location to location.
-Paul Edmon-


On 8/6/2020 10:31 PM, Hoyle, Alan P wrote:
I can't find any advice online about how to tune things like MaxJobson a per-cluster or per-user basis.
As far as I can tell, it seems that the default install clusterMaxJobs seems to be 10,000 and MaxSubmit as the same. Those seempretty low to me: are there resources that get consumed if maxSubmitis much higher or can we raise this without much worry?
Is there advice anywhere about tuning these? When I google, all Ican find are the generic "here's how to change this" and variousuniversities' documentation of "here are the limits we have set."
-alan

--
Alan Hoyle - al...@unc.edu <mailto:al...@unc.edu>
Bioinformatics Scientist
UNC Lineberger - Bioinformatics Core
https://lbc.unc.edu/<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flbc.unc.edu%2F&data=01%7C01%7Cstsmith%40unr.edu%7C65c3e7b5a3eb4f82ee9408d83ad54dd4%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1&sdata=nzdvT9uLS3iYCzs5Tm2HkifSzcRvVjIFosqLdf7Iafk%3D&reserved=0>

Re: [slurm-users] Tuning MaxJobs and MaxJobsSubmit per user and for the whole cluster?

Reply via email to