We are using a single system "cluster" and want some control of fair-use
with the GPUs. The sers are not supposed to be able to use the GPUs until
they have allocated the resources through slurm. We have no head node, so
slurmctld, slurmdbd, and slurmd are all run on the same system.
I have a conf
Dear Christopher,
I tried as you suggested and increased UnkillableStepTimeout from 60 to 120
seconds, but a few hours later three of my nodes were drained with reason
"Kill task failed" again. We're not using cgroups. There is a bug¹ on
SchedMD's tracker describing attempts to understand this err
Hi Fabio,
My guess is that you can (partly) solve this by using the correct state
in slurm.conf. Either CLOUD or FUTURE might be what you're looking for.
See `man slum.conf`.
Kind regards,
Martijn Kruiten
On Fri, 2019-05-17 at 09:17 +, Verzelloni Fabio wrote:
> Hello,
> I have a question r
Hello,
I have a question related to the cloud feature or a feature that can solve an
issue that I have with my cluster,to make it simple let say that I have a set
of nodes ( let say 10 nodes ), if needed I move node/s from cluster A to
cluster B and in my slurm.conf I define all the possible num