Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Robert Kudyba Wed, 02 Dec 2020 07:20:50 -0800

>
> been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It
> seems to have started to occur when I enabled proctrack/cgroup and changed
> select/linear to select/con_tres.
>
Our slurm.conf has the same setting:
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
EnforcePartLimits=YES


We enabled MPS too. Not sure if that's relevant.


> Are you using cgroup process tracking and have you manipulated the
> cgroup.conf file?
>
Here's what we have in ours:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
TaskAffinity=no
ConstrainCores=no
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=no
ConstrainKmemSpace=yes
AllowedRamSpace=100
AllowedSwapSpace=0
MinKmemSpace=30
MaxKmemPercent=100
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

  Do jobs complete correctly when not cancelled?


Yes they do and canceling doesn't always result in a node draining.

So would this be a Slurm issue or Bright? I'm telling users to add 'sleep
60' as the last line in their sbatch files.

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Reply via email to