How are you running the lxd container? Have you setup the device as a
passthru to the container?
I have not run lxd containers under slurm (I use apptainer/podman) but I
have used lxd VMs as nodes with no issues, even using GPUs/IB cards in them.
Brian Andrus
On 11/26/2025 8:32 AM, wk5ng--- via slurm-users wrote:
Hi all,
I'm having some trouble getting Slurm 24.11.6 to work with MIG, and the slurmd
logs seem to point to an issue with eBPF. For some context, this is an LXD
unprivileged container where I'm trying to get MIG to work with Slurm. Other
compute nodes without MIG work fine and isolate the GPUs accordingly.
What I'm seeing in slurmd logs:
[2025-11-24T23:32:50.197] [331.interactive] cgroup/v2:
cgroup_p_constrain_apply: CGROUP: EBPF Closing and loading bpf program into
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_331
[2025-11-24T23:32:50.197] [331.interactive] error: load_ebpf_prog: BPF load
error (Operation not permitted). Please check your system limits (MEMLOCK).
I've tried increasing the system limits for MEMLOCK by setting
DefaultLimitMEMLOCK=infinity in /etc/systemd/system.conf, and I've copied my
slurmd.service file below where I've set Delegate=yes and
LimitMEMLOCK=infinity. Previously only Delegate=yes wasn't set (I had rifled
through the cgroupv2 documentaton for Slurm and found that setting), but in
both cases I see the same BPF load error.
Just wondering if this was something that other people had come across before
and maybe I'm doing something silly here. I've checked that my slurm.conf has
the corresponding parameters set according to Slurm's own documentation for
cgroup.conf and my cgroup.conf is also copied below.
Some portion of the gres.conf is also copied below, and even though I tried
AutoDetect=nvml for this node, it's still doesn't work, which was why I changed
to manually setting it based off the output of slurmd -G.
Maybe I should try switching back to cgroupv1 and see if that helps fix things,
but I'm not sure at this point if MIG and Slurm are compatible using cgroupv1.
I can send other parts of logs, configuration files etc. Any help would be
greatly appreciated!
###### slurmd.service file
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
Delegate=yes
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
###### cgroup.conf
CgroupPlugin=autodetect
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
##### gres.conf
NodeName=gpu-3 AutoDetect=nvml Name=gpu
NodeName=gpu-4 Name=gpu
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap30,/dev/nvidia-caps/nvidia-cap31
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]