My rule of thumb for our cluster is 1,024 jobs/node.  Our nodes have 32 cores, 
so we're 32x core count (converting to Paul's units).  We have 120 nodes with a 
maximum of 122,880 jobs.

At a high-level, nodes are allocated to different partitions and each partition 
is allocated a maximum number of jobs equal to 1024 * num_nodes (reality isn't 
quite this simple).  Our largest partition features 54,272 max jobs (53 nodes). 
 I've seen this maxed out a number of times with a large number of very short 
jobs, and with job arrays.

This setup has required a bit of tuning.  Adjusting sched_max_job_start and 
sched_min_interval has been sufficient to keep Slurm responsive when users are 
submitting or cancelling a large number of jobs.  Backfill tuning has been 
difficult because we made some poor decisions setting DefaultTime at the epoch 
of our system.  Overall performance has been excellent after minimal tuning.

- Sebastian

--

[University of Nevada, Reno]<http://www.unr.edu/>
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050<tel:7756825050>
email: stsm...@unr.edu<mailto:stsm...@unr.edu>
website: http://rc.unr.edu<http://rc.unr.edu/>

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Paul 
Edmon <ped...@cfa.harvard.edu>
Sent: Friday, August 7, 2020 6:22 AM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Tuning MaxJobs and MaxJobsSubmit per user and for 
the whole cluster?


My rule of thumb is that the MaxJobs for the entire cluster is twice the number 
of cores you have available.  That way you have enough jobs running to fill all 
the cores and enough jobs pending to refill them.


As for per user MaxJobs, it just depends on what you think the maximum number 
any user can run with out causing damage to themselves, the underlying 
filesystems, and interfering with other users.  Practical experience has lead 
to us setting that limit to be 10,000 on our cluster, but I imagine it will 
vary from location to location.


-Paul Edmon-


On 8/6/2020 10:31 PM, Hoyle, Alan P wrote:
I can't find any advice online about how to tune things like MaxJobs on a 
per-cluster or per-user basis.

As far as I can tell, it seems that the default install cluster MaxJobs seems 
to be 10,000 and MaxSubmit as the same.  Those seem pretty low to me:  are 
there resources that get consumed if maxSubmit is much higher or can we raise 
this without much worry?

Is there advice anywhere about tuning these?  When I google, all I can find are 
the generic "here's how to change this" and various universities' documentation 
of "here are the limits we have set."

-alan


--
Alan Hoyle - al...@unc.edu<mailto:al...@unc.edu>
Bioinformatics Scientist
UNC Lineberger - Bioinformatics Core
https://lbc.unc.edu/<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flbc.unc.edu%2F&data=01%7C01%7Cstsmith%40unr.edu%7C65c3e7b5a3eb4f82ee9408d83ad54dd4%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1&sdata=nzdvT9uLS3iYCzs5Tm2HkifSzcRvVjIFosqLdf7Iafk%3D&reserved=0>

Reply via email to