So my guess is that the correct line could be something like
Name=gpu Type=a100 MultipleFile=/dev/nvidia0,/dev/dri/card1,/dev/dri/renderD128
(our machines have an integrated GPU, too, which creates /dev/dri/card0, but not renderD device; that's why I allocate card1 to the 0th nvidia gpu)
I'll try to make a test setup using this and report how it works. Most importantly, it would be essential to know whether the card* and renderD* device names are also assigned in PCI order (hope so!). And whether cgroups are handling these devices correctly. There would also be a problem how to report which card* and renderD* devices the user can use in a job, but if they can be devised from SLURM_STEP_GPUS, it wouldn't be difficult to provide a userspace script that generates the list of usable devices.
Yes, virtualgl+xvfb or virtualgl+turbovnc is exactly the use-case on my mind. We had this working on a headless non-slurm server without a lot of problems, running a robotics simulator with rendering sensors, and sometimes even with GUI.Is your goal to enable VirtualGL for jobs? If it is, I tried a solution with packing it, its dependencies, a minimal X11 server and turbovnc into a singularity image which can be used in a job. This worked as a proof of concept for glxgears, but not for the software users wanted to run.
Eventually this might work with Vulkan instead of OpenGL. Software in question would have to be updated, too, GPU drivers would have to support the needed Vulkan features as well.
No idea which devices Vulkan uses. Are they also the DRM devices? Martin
smime.p7s
Description: Elektronicky podpis S/MIME