Dear all, We have upgraded our cluster from 13 to slurm17.11.. We have some problem with gpu configurations.. Although I request no GPUs, system let me use gpu cards..
Let me explain.. *Slurm.conf: *SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory TaskPlugin=task/cgroup TaskPlugin=task/cgroup PreemptType=preempt/none NodeName=cudanode[1-20] Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=384000 Gres=gpu:2 PartitionName=cuda Nodes=cudanode[1-20] Default=no MaxTime=15-00:00:00 defaulttime=00:02:00 State=UP DefMemPerCPU=8500 MaxMemPerNode=380000 Shared=NO Priority=1000 *Gres.conf:* Name=gpu File=/dev/nvidia0 CPUs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 Name=gpu File=/dev/nvidia1 CPUs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 I am testing the configuration with deviceQuery app comes with cuda9 pack. When I send a job with 2 gpus, system reserved right number of GPUS.. *srun -n 1 -p **cuda --nodelist=**cudanode1 --gres=gpu:2 ./cuda.sh* CUDA_VISIBLE_DEVICES: 0,1 deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB, Device1 = Tesla P100-PCIE-16GB Result = PASS When I send a job with 1 gpus, system reserved right number of GPUS.. *srun -n 1 -p **cuda --nodelist=**cudanode1 --gres=gpu:1 ./cuda.sh * CUDA_VISIBLE_DEVICES: 0 deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla P100-PCIE-16GB Result = PASS *But when I send a job without any GPUS, system also let me use GPUS, that I dont expect. srun -n 1 -p cuda --nodelist=cudanode1 ./cuda.sh CUDA_VISIBLE_DEVICES: deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB, Device1 = Tesla P100-PCIE-16GB Result = PASS* By this way. I am able to run 40 jobs which all use the gpus on one server at the same time. Is it a bug or I missed something ? While we use previous versions of slurm, gpu allocation was how I expected. I also tried with cuda-enabled namd which uses higher level hardware access methods and I get the same result. Another problem I hit, when I change the gpu configuration from Gres=gpu:2 to Gres=gpu:no_consume:2 to be able to use simultaneously by many jobs, system let me use all cards independent of how many cards I request.. Regards, Sefa ARSLAN