[slurm-users] Re: MIG and eBPF issues (Slurm 24.11.6)

Brian Andrus via slurm-users Wed, 26 Nov 2025 15:38:56 -0800

How are you running the lxd container? Have you setup the device as apassthru to the container?

I have not run lxd containers under slurm (I use apptainer/podman) but Ihave used lxd VMs as nodes with no issues, even using GPUs/IB cards in them.


Brian Andrus

On 11/26/2025 8:32 AM, wk5ng--- via slurm-users wrote:

Hi all,

I'm having some trouble getting Slurm 24.11.6 to work with MIG, and the slurmd 
logs seem to point to an issue with eBPF. For some context, this is an LXD 
unprivileged container where I'm trying to get MIG to work with Slurm. Other 
compute nodes without MIG work fine and isolate the GPUs accordingly.

What I'm seeing in slurmd logs:
[2025-11-24T23:32:50.197] [331.interactive] cgroup/v2: 
cgroup_p_constrain_apply: CGROUP: EBPF Closing and loading bpf program into 
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_331
[2025-11-24T23:32:50.197] [331.interactive] error: load_ebpf_prog: BPF load 
error (Operation not permitted). Please check your system limits (MEMLOCK).

I've tried increasing the system limits for MEMLOCK by setting 
DefaultLimitMEMLOCK=infinity in /etc/systemd/system.conf, and I've copied my 
slurmd.service file below where I've set Delegate=yes and 
LimitMEMLOCK=infinity. Previously only Delegate=yes wasn't set (I had rifled 
through the cgroupv2 documentaton for Slurm and found that setting), but in 
both cases I see the same BPF load error.

Just wondering if this was something that other people had come across before 
and maybe I'm doing something silly here. I've checked that my slurm.conf has 
the corresponding parameters set according to Slurm's own documentation for 
cgroup.conf and my cgroup.conf is also copied below.

Some portion of the gres.conf is also copied below, and even though I tried 
AutoDetect=nvml for this node, it's still doesn't work, which was why I changed 
to manually setting it based off the output of slurmd -G.

Maybe I should try switching back to cgroupv1 and see if that helps fix things, 
but I'm not sure at this point if MIG and Slurm are compatible using cgroupv1.

I can send other parts of logs, configuration files etc. Any help would be 
greatly appreciated!

###### slurmd.service file
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
Delegate=yes
LimitMEMLOCK=infinity
LimitSTACK=infinity

[Install]
WantedBy=multi-user.target

###### cgroup.conf
CgroupPlugin=autodetect
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

##### gres.conf
NodeName=gpu-3 AutoDetect=nvml Name=gpu
NodeName=gpu-4 Name=gpu 
MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap30,/dev/nvidia-caps/nvidia-cap31


--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: MIG and eBPF issues (Slurm 24.11.6)

Reply via email to