[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users

I observe same behavior on slurm 23.11.5 Rocky Linux8.9..

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> memory pids
> [root@compute ~]# systemctl disable slurmd
> Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids
> [root@compute ~]# systemctl enable slurmd
> Created symlink 
/etc/systemd/system/multi-user.target.wants/slurmd.service → 
/usr/lib/systemd/system/slurmd.service.

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids

over time (i see this thread is ~1 year old, is here better / new 
understanding of this?


cheers

josef


On 23. 05. 23 12:46, Alan Orth wrote:
I notice the exact same behavior as Tristan. My CentOS Stream 8 system 
is in full unified cgroupv2 mode, the slurmd.service has a 
"Delegate=Yes" override added to it, and all cgroup stuff is added to 
slurm.conf and cgroup.conf, yet slurmd does not start after reboot. I 
don't understand what is happening, but I see the exact same behavior 
regarding the cgroup subtree_control with disabling / re-enabling slurmd.




smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users
There needs to be a slurmstepd infinity process running before slurmd starts.
This doc goes into it:
https://slurm.schedmd.com/cgroup_v2.html

Probably a better way to do this, but this is what we do to deal with that:

::
files/slurm-cgrepair.service
::
[Unit]
Before=slurmd.service slurmctld.service
After=nas-longleaf.mount remote-fs.target system.slice

[Service]
Type=oneshot
ExecStart=/callback/slurm-cgrepair.sh

[Install]
WantedBy=default.target
::
files/slurm-cgrepair.sh
::
#!/bin/bash
/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/cgroup.subtree_control && \
/usr/bin/echo +cpu +cpuset +memory >> 
/sys/fs/cgroup/system.slice/cgroup.subtree_control

/usr/sbin/slurmstepd infinity &




From: Josef Dvoracek via slurm-users 
Sent: Thursday, April 11, 2024 11:14 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: Slurmd enabled crash with CgroupV2


I observe same behavior on slurm 23.11.5 Rocky Linux8.9..

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> memory pids
> [root@compute ~]# systemctl disable slurmd
> Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids
> [root@compute ~]# systemctl enable slurmd
> Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service → 
> /usr/lib/systemd/system/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids

over time (i see this thread is ~1 year old, is here better / new understanding 
of this?

cheers

josef


On 23. 05. 23 12:46, Alan Orth wrote:
I notice the exact same behavior as Tristan. My CentOS Stream 8 system is in 
full unified cgroupv2 mode, the slurmd.service has a "Delegate=Yes" override 
added to it, and all cgroup stuff is added to slurm.conf and cgroup.conf, yet 
slurmd does not start after reboot. I don't understand what is happening, but I 
see the exact same behavior regarding the cgroup subtree_control with disabling 
/ re-enabling slurmd.


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users

thanks for hint.

so you end with two "slurmstepd infinity" processes like me when I tried 
this workaround?


[root@node ~]# ps aux | grep slurm
root    1833  0.0  0.0  33716  2188 ?    Ss   21:02   0:00 
/usr/sbin/slurmstepd infinity
root    2259  0.0  0.0 236796 12108 ?    Ss   21:02   0:00 
/usr/sbin/slurmd --systemd
root    2331  0.0  0.0  33716  1124 ?    S    21:02   0:00 
/usr/sbin/slurmstepd infinity
root    2953  0.0  0.0 221944  1092 pts/0    S+   21:12   0:00 grep 
--color=auto slurm

[root@node ~]#

BTW, I found mention of change in slurm cgroupsv2 code in changelog of 
slurm for next release,


https://github.com/SchedMD/slurm/blob/master/NEWS

one can see here the commit

https://github.com/SchedMD/slurm/commit/c21b48e724ec6f36d82c8efb1b81b6025ede240d

referring to bug

https://bugs.schedmd.com/show_bug.cgi?id=19157

but as the bug is private, I can not see the bug description.

So perhaps with Slurm 24.xx release we'll see something new.

cheers

josef








On 11. 04. 24 19:53, Williams, Jenny Avis wrote:


There needs to be a slurmstepd infinity process running before slurmd 
starts.


This doc goes into it:
https://slurm.schedmd.com/cgroup_v2.html

Probably a better way to do this, but this is what we do to deal with 
that:


::

files/slurm-cgrepair.service

::

[Unit]

Before=slurmd.service slurmctld.service

After=nas-longleaf.mount remote-fs.target system.slice

[Service]

Type=oneshot

ExecStart=/callback/slurm-cgrepair.sh

[Install]

WantedBy=default.target

::

files/slurm-cgrepair.sh

::

#!/bin/bash

/usr/bin/echo +cpu +cpuset +memory >> 
/sys/fs/cgroup/cgroup.subtree_control && \


/usr/bin/echo +cpu +cpuset +memory >> 
/sys/fs/cgroup/system.slice/cgroup.subtree_control


/usr/sbin/slurmstepd infinity &

*From:*Josef Dvoracek via slurm-users 
*Sent:* Thursday, April 11, 2024 11:14 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] Re: Slurmd enabled crash with CgroupV2

I observe same behavior on slurm 23.11.5 Rocky Linux8.9..

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> memory pids
> [root@compute ~]# systemctl disable slurmd
> Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids
> [root@compute ~]# systemctl enable slurmd
> Created symlink 
/etc/systemd/system/multi-user.target.wants/slurmd.service → 
/usr/lib/systemd/system/slurmd.service.

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids

over time (i see this thread is ~1 year old, is here better / new 
understanding of this?


cheers

josef

On 23. 05. 23 12:46, Alan Orth wrote:

I notice the exact same behavior as Tristan. My CentOS Stream 8
system is in full unified cgroupv2 mode, the slurmd.service has a
"Delegate=Yes" override added to it, and all cgroup stuff is added
to slurm.conf and cgroup.conf, yet slurmd does not start after
reboot. I don't understand what is happening, but I see the exact
same behavior regarding the cgroup subtree_control with disabling
/ re-enabling slurmd.



smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users
The end goal is to see the following 2 things -
jobs under the slurmstepd cgroup path, and
the cpu,cpuset,memory at least in the cgroup.controllers file within the jobs 
cgroups.controller list.

The pattern you have would be the processes left after boot, first failed 
slurmd service start which leaves a slurmstepd infinity process, and then the 
second slurmd starts. In your case there is a second slurmstepd infinity 
process. As to why those specifics I can’t answer that one sitting here without 
poking at it more.


Having that slurmstepd infinity running with the cgroups needed ( for us at a 
minimum cpuset, cpu and memory - YMMV depending on the cgroups.conf settings ) 
before slurmd tries to start is what enables slurmd to start.
The necessary piece to this working is that the required controls are available 
at the parent of the path before the slurmd and in particular slurmstepd 
infinity start.

Our cgroup.conf file is:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
CgroupPlugin=cgroup/v2
AllowedSwapSpace=1
ConstrainDevices=yes
ConstrainSwapSpace=yes

So the resulting missing piece to get slurmd to start at boot is corrected by 
running these mods to the cgroup controls before the slurmd service attempts to 
start.  As a test, on your system as it is now without adding anything I’ve 
mentioned, try having a cgroup.conf with zero Constrain statements.  My bet is 
that in that case slurmd starts clean on boot in that case.  I hope that the 
bug fix does not change slurmd to be more liberal about checking the cgroup 
control list. - it took a while before I trusted that the controls were 
actually there so knowing if slurmd starts the controls are there is great.

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/cgroup.subtree_control && \
/usr/bin/echo +cpu +cpuset +memory >> 
/sys/fs/cgroup/system.slice/cgroup.subtree_control



The job cgroup propagation ( contents of cgroup.controllers files along the 
cgroup path ) after slurmd + slurmstepd infinity start is via the cgroup path 
established under slurmstepd.scope .  If there is no slurmstepd infinity slurm 
will start one; if slurmstepd infinity is running and it sets up at minimum the 
cgroups slurmd needs based on what is in cgroups.con. then slurmd doesn’t end 
up starting more slurmstepd infinity processes.  My recollection is that first 
slurmstepd infinity does set up the needed cgroup controllers which is why a 
second slurmd attempt then starts.

To see slurmd complaining about the specifics try disabling slurmd service, 
reboot, set SLURM_DEBUG_FLAGS = cgroups then run slurmd -D -vvv manually . I am 
fairly sure that helps see the particulars better.

Theoretically in our setup with the slurm-cgrepair.service we force a 
slurmstepd infinity process to be running prior* to the slurmd service 
finishing * ( IDK the PID order says otherwise )
# systemctl show slurmd |egrep cgrepair
After=network-online.target systemd-journald.socket slurm-cgrepair.service 
remote-fs.target system.slice sysinit.target nvidia.service munge.service 
basic.target

The resulting behavior of this setup is as we expect - the slurmd service is 
running on nodes after reboot without intervention.  Our steps may not be all 
necessary, but they are sufficient.

The list of cgroup controllers ( cpu , cpuset, memory for 
slurmstepd.scope/job_  ) for processes further down the cgoup path can only 
be a subset of any parent in the cgroup path ( cgroup , cpuset, memory, pid for 
slurmstepd.scope ).

You asked in the context of what our process tree looks like - here is that 
information.  I add the cgoup field in top for ongoing assurance that user 
processes are under the slurmstepd.scope path.

This is the process tree on our nodes.

# ps aux |grep slurm |head -n 15 |sed 's//aUser/g'
root8687  0.0  0.0 6471088 34044 ?   Ss   Apr03   0:29 
/usr/sbin/slurmd -D -s
root8694  0.0  0.0  33668  1080 ?SApr03   0:00 
/usr/sbin/slurmstepd infinity
root 2942928  0.0  0.0 311804  7416 ?Sl   Apr06   0:42 slurmstepd: 
[35400562.extern]
root 2942930  0.0  0.0 311804  7164 ?Sl   Apr06   0:43 slurmstepd: 
[35400563.extern]
root 2942933  0.0  0.0 311804  7144 ?Sl   Apr06   0:45 slurmstepd: 
[35400564.extern]
root 2942935  0.0  0.0 311804  7280 ?Sl   Apr06   0:38 slurmstepd: 
[35400565.extern]
root 2942953  0.0  0.0 312164  7496 ?Sl   Apr06   0:45 slurmstepd: 
[35400564.batch]
root 2942958  0.0  0.0 312164  7620 ?Sl   Apr06   0:41 slurmstepd: 
[35400562.batch]
root 2942960  0.0  0.0 312164  7636 ?Sl   Apr06   0:43 slurmstepd: 
[35400563.batch]
root 2942962  0.0  0.0 312164  7728 ?Sl   Apr06   0:41 slurmstepd: 
[35400565.batch]
aUser2942972  0.0  0.0  12868  3072 ?SN   Apr06   0:00 /bin/bash 
/var/spool/slurmd/job35400562/slurm_script
aUser2942973  0.0  0.0  12868  2868 ?SN   Apr06   0:00 /bin/bash 
/var/spool/slu