Hi, Thanks a lot Yair, ConstraintDevices=yes solve the the problem partially. For the configuration, NodeName=cudanode[1-20] Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=384000 Gres=gpu:2
I can get exact number of gpus that I request.. But for for NodeName=cudanode[1-20] Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=384000 Gres=gpu:no_consume:2 It is was not posssible to run any job with gpu request.. At slurmd log I get [2018-03-12T14:21:30.385] [114097.0] debug3: gres_device_major : /dev/nvidia0 major 195, minor 0 [2018-03-12T14:21:30.385] [114097.0] debug3: gres_device_major : /dev/nvidia1 major 195, minor 1 ... [2018-03-12T14:21:30.392] [114097.0] debug: Not allowing access to device c 195:0 rwm(/dev/nvidia0) for job [2018-03-12T14:21:30.392] [114097.0] debug3: xcgroup_set_param: parameter 'devices.deny' set to 'c 195:0 rwm' for '/sys/fs/cgroup/devices/slurm/uid_1487/job_114097' [2018-03-12T14:21:30.392] [114097.0] debug: Not allowing access to device c 195:1 rwm(/dev/nvidia1) for job [2018-03-12T14:21:30.392] [114097.0] debug3: xcgroup_set_param: parameter 'devices.deny' set to 'c 195:1 rwm' for '/sys/fs/cgroup/devices/slurm/uid_1487/job_114097' ... I have read some post about constraintDevices and some bugs fixes on slurm-17.11.0, and then I upgrade slurm to 17.11.04, the above log belongs to slurm 17.11.04. Regards Sefa ARSLAN > Hi, > > This is just a guess, but there's also a cgroup.conf file where you > might need to add: > > ConstrainDevices=yes > > see: > https://slurm.schedmd.com/cgroup.conf.html > > for more details. > > HTH, > Yair. > > On Mon, Mar 12 2018, Sefa Arslan <sefa.ars...@tubitak.gov.tr> wrote: > >> Dear all, >> >> We have upgraded our cluster from 13 to slurm17.11.. We have some problem >> with >> gpu configurations.. Although I request no GPUs, system let me use gpu >> cards.. >> >> Let me explain.. >> Slurm.conf: >> SelectType=select/cons_res >> SelectTypeParameters=CR_CPU_Memory >> TaskPlugin=task/cgroup >> TaskPlugin=task/cgroup >> PreemptType=preempt/none >> >> NodeName=cudanode[1-20] Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 >> RealMemory=384000 Gres=gpu:2 >> PartitionName=cuda Nodes=cudanode[1-20] Default=no MaxTime=15-00:00:00 >> defaulttime=00:02:00 State=UP DefMemPerCPU=8500 MaxMemPerNode=380000 >> Shared=NO >> Priority=1000 >> >> Gres.conf: >> Name=gpu File=/dev/nvidia0 >> CPUs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 >> Name=gpu File=/dev/nvidia1 >> CPUs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 >> >> I am testing the configuration with deviceQuery app comes with cuda9 pack. >> >> When I send a job with 2 gpus, system reserved right number of GPUS.. >> srun -n 1 -p cuda --nodelist=cudanode1 --gres=gpu:2 ./cuda.sh >> CUDA_VISIBLE_DEVICES: 0,1 >> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime >> Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB, Device1 = Tesla >> P100-PCIE-16GB Result = PASS >> >> When I send a job with 1 gpus, system reserved right number of GPUS.. >> >> srun -n 1 -p cuda --nodelist=cudanode1 --gres=gpu:1 ./cuda.sh >> CUDA_VISIBLE_DEVICES: 0 >> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime >> Version = 7.5, NumDevs = 1, Device0 = Tesla P100-PCIE-16GB >> Result = PASS >> >> But when I send a job without any GPUS, system also let me use GPUS, that I >> dont expect. >> srun -n 1 -p cuda --nodelist=cudanode1 ./cuda.sh >> CUDA_VISIBLE_DEVICES: >> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime >> Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB, Device1 = Tesla >> P100-PCIE-16GB >> Result = PASS >> >> By this way. I am able to run 40 jobs which all use the gpus on one server at >> the same time. Is it a bug or I missed something ? While we use previous >> versions of slurm, gpu allocation was how I expected. I also tried with >> cuda-enabled namd which uses higher level hardware access methods and I get >> the >> same result. >> >> Another problem I hit, when I change the gpu configuration from Gres=gpu:2 to >> Gres=gpu:no_consume:2 to be able to use simultaneously by many jobs, system >> let >> me use all cards independent of how many cards I request.. >> >> Regards, >> Sefa ARSLAN