[slurm-users] Re: Slurmd enabled crash with CgroupV2

Josef Dvoracek via slurm-users Thu, 11 Apr 2024 12:28:15 -0700

thanks for hint.

so you end with two "slurmstepd infinity" processes like me when I tried this workaround?


[root@node ~]# ps aux | grep slurm

root 1833 0.0 0.0 33716 2188 ? Ss 21:02 0:00 /usr/sbin/slurmstepd infinity root 2259 0.0 0.0 236796 12108 ? Ss 21:02 0:00 /usr/sbin/slurmd --systemd root 2331 0.0 0.0 33716 1124 ? S 21:02 0:00 /usr/sbin/slurmstepd infinity root 2953 0.0 0.0 221944 1092 pts/0 S+ 21:12 0:00 grep --color=auto slurm

[root@node ~]#

BTW, I found mention of change in slurm cgroupsv2 code in changelog of slurm for next release,


https://github.com/SchedMD/slurm/blob/master/NEWS

one can see here the commit

https://github.com/SchedMD/slurm/commit/c21b48e724ec6f36d82c8efb1b81b6025ede240d

referring to bug

https://bugs.schedmd.com/show_bug.cgi?id=19157

but as the bug is private, I can not see the bug description.

So perhaps with Slurm 24.xx release we'll see something new.

cheers

josef








On 11. 04. 24 19:53, Williams, Jenny Avis wrote:

There needs to be a slurmstepd infinity process running before slurmd starts.


This doc goes into it:
https://slurm.schedmd.com/cgroup_v2.html

Probably a better way to do this, but this is what we do to deal with that:


::::::::::::::

files/slurm-cgrepair.service

::::::::::::::

[Unit]

Before=slurmd.service slurmctld.service

After=nas-longleaf.mount remote-fs.target system.slice

[Service]

Type=oneshot

ExecStart=/callback/slurm-cgrepair.sh

[Install]

WantedBy=default.target

::::::::::::::

files/slurm-cgrepair.sh

::::::::::::::

#!/bin/bash

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/cgroup.subtree_control && \

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/system.slice/cgroup.subtree_control


/usr/sbin/slurmstepd infinity &

*From:*Josef Dvoracek via slurm-users <slurm-users@lists.schedmd.com>
*Sent:* Thursday, April 11, 2024 11:14 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] Re: Slurmd enabled crash with CgroupV2

I observe same behavior on slurm 23.11.5 Rocky Linux8.9..

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> memory pids
> [root@compute ~]# systemctl disable slurmd
> Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids
> [root@compute ~]# systemctl enable slurmd

> Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service → /usr/lib/systemd/system/slurmd.service.

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids

over time (i see this thread is ~1 year old, is here better / new understanding of this?


cheers

josef

On 23. 05. 23 12:46, Alan Orth wrote:

    I notice the exact same behavior as Tristan. My CentOS Stream 8
    system is in full unified cgroupv2 mode, the slurmd.service has a
    "Delegate=Yes" override added to it, and all cgroup stuff is added
    to slurm.conf and cgroup.conf, yet slurmd does not start after
    reboot. I don't understand what is happening, but I see the exact
    same behavior regarding the cgroup subtree_control with disabling
    / re-enabling slurmd.

smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Slurmd enabled crash with CgroupV2

Reply via email to