Hi Tim and community, We are currently having the same issue (cgroups not working it seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days ago after a full update (apt upgrade). Now whenever we launch a job for that partition, we get the error message mentioned by Tim. As a note, we have another custom GPU-compute node with L40s, on a different partition, and that one works fine. Before this error, we always had small differences in kernel version between nodes, so I am not sure if this can be the problem. Nevertheless, here is the info of our nodes as well.
*[Problem node]* The DGX A100 node has this kernel cnavarro@nodeGPU01:~$ uname -a Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux *[Functioning node]* The Custom GPU node (L40s) has this kernel cnavarro@nodeGPU02:~$ uname -a Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux *And the login node *(slurmctld) ➜ ~ uname -a Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Any ideas what we should check? On Thu, Jan 4, 2024 at 3:03 PM Tim Schneider <tim.schneid...@tu-darmstadt.de> wrote: > Hi, > > I am using SLURM 22.05.9 on a small compute cluster. Since I reinstalled > two of our nodes, I get the following error when launching a job: > > slurmstepd: error: load_ebpf_prog: BPF load error (No space left on > device). Please check your system limits (MEMLOCK). > > Also the cgroups do not seem to work properly anymore, as I am able to > see all GPUs even if I do not request them, which is not the case on the > other nodes. > > One difference I found between the updated nodes and the original nodes > (both are Ubuntu 22.04) is the kernel version, which is > "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and > "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could not > figure out how to install the exact first kernel version on the updated > nodes, but I noticed that when I reinstall 5.15.0 with this tool: > https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error message > disappears. However, once I do that, the network driver does not > function properly anymore, so this does not seem to be a good solution. > > Has anyone seen this issue before or is there maybe something else I > should take a look at? I am also happy to just find a workaround such > that I can take these nodes back online. > > I appreciate any help! > > Thanks a lot in advance and best wishes, > > Tim > > > -- Cristóbal A. Navarro