On 7/2/21 7:34 AM, Jack Chen wrote:
Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm.

I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning off cgroup, it disappears.

Linux kernel: 3.10.0-327.36.3.el7.x86_64

Slurm version: 15.08-11

For Cgroups support I believe you need to upgrade to a much more recent Slurm version!! Probably Slurm 17.02.5 or later, see
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#cgroup-configuration

PS: upgrading the slurm version is almost impossible. I'm familiar with slurm code, so I want to fix it in slurm 15.08

IMHO, you will suffer many problems if you stick with this old 15.08 release. It is definitely feasible to upgrade Slurm, although you have to take great care with the database upgrade if upgrading from 17.02 or older. Upgrading between recent versions is quite straightforward, but it is imperative that you upgrade by at most 2 versions at a time!

I have collected upgrading experience and documentation here:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

Best regards,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Reply via email to