Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm.
I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning off cgroup, it disappears. Linux kernel: 3.10.0-327.36.3.el7.x86_64 Slurm version: 15.08-11 example of killed job log: srun: error ip-65: task 42: Killed sun: Terminating job step 10346.0 slurmstepd: *** STEP 10346.0 ON ip-54 CANCELLED AT 2021-06-07T02:35:36 *** srun: error: ip-65: tasks 40,46 Killed srun: error: ip-65: tasks 45 Killed srun: error: ip-57: tasks 19-21 Killed job logs: $ sacct -j 10310646 --format=JobID,State,ExitCode,DerivedExitCode,start JobID State ExitCode DerivedExitCode Start ------------ ---------- -------- --------------- ------------------- 10310646 COMPLETED 0:9 0:0 2021-06-06T19:34:04 cgroup.conf: I only enabled ConstrainCores: AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf" CgroupAutomount=yes CgroupMountpoint=/sys/fs/cgroup ConstrainCores=yes ConstrainDevices=no #ConstrainKmemSpace=no #avoid known Kernel issues #ConstrainRAMSpace=yes #AllowedRAMSpace=80 #ConstrainSwapSpace=yes TaskAffinity=no #use task/affinity plugin instead changes in slurm.conf to enable cgroup cpu ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity Maybe slurm or os's oom-killer? I checked worker nodes dmesg logs: grep -i 'killed process' /var/log/messages, grep -i 'oom' /var/log/messagesand find nothing So any clues about how to fix this? PS: upgrading the slurm version is almost impossible. I'm familiar with slurm code, so I want to fix it in slurm 15.08