> > been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It > seems to have started to occur when I enabled proctrack/cgroup and changed > select/linear to select/con_tres. > Our slurm.conf has the same setting: SelectType=select/cons_tres SelectTypeParameters=CR_CPU SchedulerTimeSlice=60 EnforcePartLimits=YES
We enabled MPS too. Not sure if that's relevant. > Are you using cgroup process tracking and have you manipulated the > cgroup.conf file? > Here's what we have in ours: CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=no AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf" TaskAffinity=no ConstrainCores=no ConstrainRAMSpace=no ConstrainSwapSpace=no ConstrainDevices=no ConstrainKmemSpace=yes AllowedRamSpace=100 AllowedSwapSpace=0 MinKmemSpace=30 MaxKmemPercent=100 MaxRAMPercent=100 MaxSwapPercent=100 MinRAMSpace=30 Do jobs complete correctly when not cancelled? Yes they do and canceling doesn't always result in a node draining. So would this be a Slurm issue or Bright? I'm telling users to add 'sleep 60' as the last line in their sbatch files.