Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Greg Wickham
Hi Rob, Slurm doesn’t have a “validate” parameter hence one must know ahead of time whether the configuration will work or not. In answer to your question – yes – on our site the Slurm configuration is altered outside of a maintenance window. Depending upon the potential impact of the change,

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Brian Andrus
Run a secondary controller. Do 'scontrol takeover' before any changes, make your changes and restart slurmctld on the primary. If it fails, no harm/no foul, because the secondary is still running happily. If it succeeds, it takes control back and you can then restart the secondary with the

Re: [slurm-users] Accounting/access on total usage

2023-01-17 Thread Ross Dickson
Hi Frank. You can do something very close to what you describe with a QoS for each group and the NoDecay option ( https://slurm.schedmd.com/sacctmgr.html#SECTION_SPECIFICATIONS-FOR-QOS). We use this in conjunction with PriorityUsageResetPeriod=quarterly to provide a usage cap for certain groups.

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Groner, Rob
So, you have two equal sized clusters, one for test and one for production? Our test cluster is a small handful of machines compared to our production. We have a test slurm control node on a test cluster with a test slurmdbd host and test nodes, all named specifically for test. We don't want a

Re: [slurm-users] Job cancelled into the future

2023-01-17 Thread Reed Dier
So I was going to take a stab at trying to rectify this after taking care of post-holiday matters. Paste of the $CLUSTER_job_table table where I think I see the issue, and now I just want to sanity check my steps to remediate. https://rentry.co/qhw6mg (pastebin alterna

Re: [slurm-users] Job preempts entire host instead of single job

2023-01-17 Thread Michael Gutteridge
Hi I believe this is how the preemption algorithm works- it selects the entire node's resources: > For performance reasons, the backfill scheduler reserves whole nodes for jobs, not partial nodes. - https://slurm.schedmd.com/preempt.html#limitations However, that does specifically call out t

[slurm-users] Job preempts entire host instead of single job

2023-01-17 Thread Michał Kadlof
Hi, I struggle with configuring job preempting. I have nodes with 8 Nvidia A100 GPUs. I have two partitions: short (lower priority) and sfglab (higher priority). I want to allow higher priority jobs to preempt (REQUEUE mode) lower priority job. It looks like it works, however it works too goo