Hello
   I am trying to debug an issue with EGL support (updated NVIDIA drivers and 
now EGLGetDisplay and EGLQueryDevicesExt are failing if they can't access all 
/dev/nvidia# devices in slurm) and am wondering how slurm uses device cgroups 
so I can implement the same cgroup setup by hand and test our issue outside of 
slurm.

We have slurm 20.02.05 with cgroups set up with Cores, RAMSpace, and devices 
being constrained.

When I get onto an allocation with 1 of 4 GPUs on the system nvidia-smi only 
sees the GPU I was assigned and I get permission denied when I try to access 
the other /dev/nvidia# devices.
When I look at the cgroup either through the 
/sys/fs/cgroups/slurm/uid_######/job_#########/step_0 or with the cgget command 
I see the values for memory.limit_in_bytes and cpuset.cpus being set as 
expected, but the devices.list value is set to "a *:* rwm" even though I am 
being blocked from other devices. The device restriction does seem to working 
but the cgroup parameters for it are not set as I would expect to see.

How does slurm manage the device cgroup settings on RHEL 7 so I can check and 
mimic them?

Thanks.

Reply via email to