Maybe I have good news, Stephan (and others). I discovered SLURM 20.11 added a MultipleFiles option to gres.conf, which replaces File=. There are no docs about it yet, but I found a (possibly) working snippet making use of this option here: https://bugs.schedmd.com/show_bug.cgi?id=11091#c13 .

So my guess is that the correct line could be something like

    Name=gpu Type=a100 MultipleFile=/dev/nvidia0,/dev/dri/card1,/dev/dri/renderD128

(our machines have an integrated GPU, too, which creates /dev/dri/card0, but not renderD device; that's why I allocate card1 to the 0th nvidia gpu)

I'll try to make a test setup using this and report how it works. Most importantly, it would be essential to know whether the card* and renderD* device names are also assigned in PCI order (hope so!). And whether cgroups are handling these devices correctly. There would also be a problem how to report which card* and renderD* devices the user can use in a job, but if they can be devised from SLURM_STEP_GPUS, it wouldn't be difficult to provide a userspace script that generates the list of usable devices.

Is your goal to enable VirtualGL for jobs? If it is, I tried a solution
with packing it, its dependencies, a minimal X11 server and turbovnc
into a singularity image which can be used in a job.
This worked as a proof of concept for glxgears, but not for the software
users wanted to run.
Yes, virtualgl+xvfb or virtualgl+turbovnc is exactly the use-case on my mind. We had this working on a headless non-slurm server without a lot of problems, running a robotics simulator with rendering sensors, and sometimes even with GUI.
Eventually this might work with Vulkan instead of OpenGL. Software in
question would have to be updated, too, GPU drivers would have to
support the needed Vulkan features as well.
No idea which devices Vulkan uses. Are they also the DRM devices?

Martin


Attachment: smime.p7s
Description: Elektronicky podpis S/MIME

Reply via email to