We recently upgraded from Slurm 19.05.8 to 20.11.3. In our
configuration, we have an interruptible partition named 'interruptible'
for long-running, low-priority jobs that use checkpoint/restart. Jobs
that are preempted would be killed and requeued rather than suspended.
This configuration has
We saw something that sounds similar to this. See this bug report:
https://bugs.schedmd.com/show_bug.cgi?id=10196
SchedMD never found the root cause. They thought it might have something to do
with a timing problem on Prolog scripts, but the thing that fixed it for us was
to set GraceTime=0 on
Thank you! I’ll see if this is an option … would be nice.
I’ll see if we can try this.
Best wishes
Volker
> On Feb 25, 2021, at 11:07 PM, Angelos Ching
> wrote:
>
> I think it's related to the job step launch semantic change introduced at
> 20.11.0, which has been reverted since 20.11.3, see
On 2/26/21 8:44 AM, Baldauf, Sebastian Martin wrote:
I just want to ask if someone has an idea how to give a GPU and some
CPUs of a node to one account exclusively but keep the remaining CPUs of
this node available for all users.
For me it looks like that using partitions is only working for whol