Thanks for the advice. I checked munge's log on the system that was most recently affected and found a few hundred of these:
2022-08-16 23:30:56 +0300 Info: Unauthorized credential for client UID=0 GID=0 Not sure if relevant. NTP on the system is synced. I'll keep an eye on munge in the future... Thanks again, On Tue, Aug 16, 2022 at 1:45 PM Timony, Mick <michael_tim...@hms.harvard.edu> wrote: > When I see odd behaviour I've found it sometimes related to either NTP > issues (the time is off) or munge errors: > > - Is NTP running and is the time accurate > - Look for munge errors: > - /var/log/munge/munged.log > - sudo systemctl status munge > > If it's a munge error, usually restarting munge does the trick: > > sudo systemctl restart munge > > Regards > --Mick > ------------------------------ > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Alan Orth <alan.o...@gmail.com> > *Sent:* Tuesday, August 16, 2022 4:36 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] Problems with cgroupsv2 > > I re-installed SLURM 22.05.3 and then restarted slurmd and now it's > working: > > # dnf reinstall slurm slurm-slurmd slurm-devel slurm-pam_slurm > # systemctl restart slurmd > > The dnf.log shows that the versions were the same, so there was no > mismatch or anything: > > 2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-22.05.3-1.el8.x86_64 > 2022-08-16T23:29:02+0300 DEBUG Reinstalled: > slurm-devel-22.05.3-1.el8.x86_64 > 2022-08-16T23:29:02+0300 DEBUG Reinstalled: > slurm-pam_slurm-22.05.3-1.el8.x86_64 > 2022-08-16T23:29:02+0300 DEBUG Reinstalled: > slurm-slurmd-22.05.3-1.el8.x86_64 > > So I'm not sure what's going on... anyways, at least it's working now! > > Regards, > > On Tue, Aug 16, 2022 at 12:53 PM Alan Orth <alan.o...@gmail.com> wrote: > > Dear list, > > I've been using cgroupsv2 with SLURM 22.05 on CentOS Stream 8 successfully > for a few months now. Recently a few of my nodes have started having > problems starting slurmd. The log shows: > > [2022-08-16T20:52:58.439] slurmd version 22.05.3 started > [2022-08-16T20:52:58.439] error: Controller cpuset is not enabled! > [2022-08-16T20:52:58.439] error: Controller cpu is not enabled! > [2022-08-16T20:52:58.439] error: cpu cgroup controller is not available. > [2022-08-16T20:52:58.439] error: There's an issue initializing memory or > cpu controller > [2022-08-16T20:52:58.439] error: Couldn't load specified plugin name for > jobacct_gather/cgroup: Plugin init() callback failed > [2022-08-16T20:52:58.439] error: cannot create jobacct_gather context for > jobacct_gather/cgroup > [2022-08-16T20:52:58.439] fatal: Unable to initialize jobacct_gather > > The system has cgroupsv2 enabled as far as I can tell: > > # cat /sys/fs/cgroup/cgroup.controllers > cpuset cpu io memory hugetlb pids rdma > # [ $(stat -fc %T /sys/fs/cgroup/) = "cgroup2fs" ] && echo "unified" || ( > [ -e /sys/fs/cgroup/unified/ ] && echo "hybrid" || echo "legacy") > unified > > And my slurm.conf has: > > ProctrackType=proctrack/cgroup > TaskPlugin=task/affinity,task/cgroup > > And cgroup.conf: > > CgroupAutomount=yes > CgroupPlugin=autodetect > > What else should I look for before giving up and reverting to cgroupsv1? > My current version is 22.05.3, but it was happening in 22.05.2 as well. > > Thank you for any advice. > -- > Alan Orth > alan.o...@gmail.com > https://picturingjordan.com > <https://urldefense.proofpoint.com/v2/url?u=https-3A__picturingjordan.com&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=Crq2NCkLF76f5LeQhObq0JdnDo_EKcfYlXcq0iyqQvQ&e=> > https://englishbulgaria.net > <https://urldefense.proofpoint.com/v2/url?u=https-3A__englishbulgaria.net&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=K9dvD9QmS3EWZctC_BnTaz7zdTgF_t3qdDwOtYyCHL8&e=> > https://mjanja.ch > <https://urldefense.proofpoint.com/v2/url?u=https-3A__mjanja.ch&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=D9vI36K8ewQZH9ZIUAAnhRMAJJNdjfbCE9WI-5KuJuU&e=> > > > > -- > Alan Orth > alan.o...@gmail.com > https://picturingjordan.com > <https://urldefense.proofpoint.com/v2/url?u=https-3A__picturingjordan.com&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=Crq2NCkLF76f5LeQhObq0JdnDo_EKcfYlXcq0iyqQvQ&e=> > https://englishbulgaria.net > <https://urldefense.proofpoint.com/v2/url?u=https-3A__englishbulgaria.net&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=K9dvD9QmS3EWZctC_BnTaz7zdTgF_t3qdDwOtYyCHL8&e=> > https://mjanja.ch > <https://urldefense.proofpoint.com/v2/url?u=https-3A__mjanja.ch&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=D9vI36K8ewQZH9ZIUAAnhRMAJJNdjfbCE9WI-5KuJuU&e=> > -- Alan Orth alan.o...@gmail.com https://picturingjordan.com https://englishbulgaria.net https://mjanja.ch