Thanks Kevin! Indeed, nvidia-smi in an interactive job tells me that I can get access to the device when I should not be able to.
I thought including the /dev/nvidia* would whitelist those devices ... which seems to be the opposite of what I want, no? Or do I misunderstand? Thanks, Paul On Tue, May 1, 2018, 19:00 Kevin Manalo <kman...@jhu.edu> wrote: > Paul, > > Having recently set this up, this was my test, when you make a single GPU > request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty > bash) request you should only see the GPU assigned to you via 'nvidia-smi' > > When gres is unset you should see > > nvidia-smi > No devices were found > > Otherwise, if you ask for 1 of 2, you should only see 1 device. > > Also, I recall appending this to the bottom of > > [cgroup_allowed_devices_file.conf] > .. > Same as yours > ... > /dev/nvidia* > > There was a SLURM bug issue that made this clear, not so much in the > website docs. > > -Kevin > > > On 5/1/18, 5:28 PM, "slurm-users on behalf of R. Paul Wiegand" < > slurm-users-boun...@lists.schedmd.com on behalf of rpwieg...@gmail.com> > wrote: > > Greetings, > > I am setting up our new GPU cluster, and I seem to have a problem > configuring things so that the devices are properly walled off via > cgroups. Our nodes each of two GPUS; however, if --gres is unset, or > set to --gres=gpu:0, I can access both GPUs from inside a job. > Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES > environmental variable, I can access both GPUs. From my > understanding, this suggests that it is *not* being protected under > cgroups. > > I've read the documentation, and I've read through a number of threads > where people have resolved similar issues. I've tried a lot of > configurations, but to no avail. Below I include some snippets of > relevant (current) parameters; however, I also am attaching most of > our full conf files. > > [slurm.conf] > ProctrackType=proctrack/cgroup > TaskPlugin=task/cgroup > SelectType=select/cons_res > SelectTypeParameters=CR_Core_Memory > JobAcctGatherType=jobacct_gather/linux > AccountingStorageTRES=gres/gpu > GresTypes=gpu > > NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16 > ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2 > > [gres.conf] > NodeName=evc[1-10] Name=gpu File=/dev/nvidia0 > COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 > NodeName=evc[1-10] Name=gpu File=/dev/nvidia1 > COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 > > [cgroup.conf] > ConstrainDevices=yes > > [cgroup_allowed_devices_file.conf] > /dev/null > /dev/urandom > /dev/zero > /dev/sda* > /dev/cpu/*/* > /dev/pts/* > > Thanks, > Paul. > > >