Hi, On Wed, Jul 03, 2019 at 03:49:44PM +0000, David Baker wrote: > Hello, > > > A few of our users have asked about running longer jobs on our cluster. > Currently our main/default compute partition has a time limit of 2.5 days. > Potentially, a handful of users need jobs to run up to 5 hours. Rather than > allow all users/jobs to have a run time limit of 5 days I wondered if the > following scheme makes sense... >
We have a similar issue, where default max walltime is 3 days, but due to checkpointing not working properly atm, we have several high end users asking for longer times. > > Increase the max run time on the default partition to be 5 days, however > limit most users to a max of 2.5 days using the default "normal" QOS. > > > Create a QOS called "long" with a max time limit of 5 days. Limit the user > who can use "long". For authorized users assign "long" QOS to their jobs on > basis of run time request. > > > Does the above make sense or is it too complicated? If the above works could > users limited to using the normal QOS have their running jobs run time > increased to 5 days in exceptional circumstances? Be aware that without restrictions, you users _will_ learn to take advantage of the longer allowed walltime :) We created a second partition that fully overlaps the default partition, but with double the max wall time. Access to this partition is only granted upon (motivated) request. I have no idea if this make less or more sense that your proposal, to me it's a different way to accomplish pretty much the same goal :) Kind regards, -- Andy
signature.asc
Description: PGP signature