Hi, I have an issue with GPU request in job submission. I have a single computing node (128 cores, 3GPUs) which also runs the Slurm server.
When I try to submit a job requesting a specific GPU type corresponding to a GTX 1080 (GPU id 2 on my machine), the job is not assigned to the requested GPU (c.f. Example 1 below). However, if no other GPU is available, a job can be assigned to the GPU in question (c.f. Example 2). Example 1: no resource are taken, requesting a specific GPU type "gtx1080" and getting an RTX 2080 (not working) ``` srun --gpus=gtx1080:1 --pty bash $ nvidia-smi -L GPU 0: GeForce RTX 2080 Ti (UUID: GPU-a80xxxxxxxxxx) ``` Example 2: filling GPUs in ascending order, the first job gets GPUs 0 and 1 (two RTX 2080), the second gets GPU 2 (working as expected) ``` # terminal 1 srun --gpus=2 --pty bash $ nvidia-smi -L GPU 0: GeForce RTX 2080 Ti (UUID: GPU-a80xxxxxxxxxx) GPU 1: GeForce RTX 2080 Ti (UUID: GPU-d63xxxxxxxxxx) # terminal 2 srun --gpus=1 --pty bash $ nvidia-smi -L GPU 0: GeForce GTX 1080 (UUID: GPU-f58xxxxxxxxxx) ``` I use Slurm 19.05 on an ArchLinux machine (version below) and the `slurm-llnl` AUR package. slurm.conf (c.f. file attached) ``` # COMPUTE NODES GresTypes=gpu NodeName=XXXX NodeAddr=XXXX Gres=gpu:rtx2080:2,gpu:gtx1080:1 Sockets=4 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=376000 MemSpecLimit=10000 State=UNKNOWN PartitionName=prod Nodes=XXXX OverSubscribe=YES Default=YES MaxTime=INFINITE DefaultTime=2:0:0 State=UP ``` gres.conf (c.f. file attached) ``` NodeName=XXXX Name=gpu Type=rtx2080 File=/dev/nvidia0 Cores=32-63 NodeName=XXXX Name=gpu Type=rtx2080 File=/dev/nvidia1 Cores=64-95 NodeName=XXXX Name=gpu Type=gtx1080 File=/dev/nvidia2 Cores=96-127 ``` System info ``` $ uname -a Linux XXXX 5.1.15-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 25 04:49:39 UTC 2019 x86_64 GNU/Linux ``` Thanks in advanced Best regards, Ghislain Durif
NodeName=XXXX Name=gpu Type=rtx2080 File=/dev/nvidia0 Cores=32-63 NodeName=XXXX Name=gpu Type=rtx2080 File=/dev/nvidia1 Cores=64-95 NodeName=XXXX Name=gpu Type=gtx1080 File=/dev/nvidia2 Cores=96-127
# # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=YYYY ControlMachine=XXXX #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid #PluginDir= #FirstJobId= ReturnToService=0 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/cgroup #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SelectType=select/linear SelectType=select/cons_tres FastSchedule=1 SelectTypeParameters=CR_CPU_Memory #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 PriorityType=priority/multifactor PriorityFlags=CALCULATE_RUNNING,SMALL_RELATIVE_TO_TIME PriorityFavorSmall=yes DefMemPerCPU=2000 MaxMemPerCPU=2800 DefMemPerGPU=80000 DefCpuPerGPU=32 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES GresTypes=gpu NodeName=XXXX NodeAddr=XXXX Gres=gpu:rtx2080:2,gpu:gtx1080:1 Sockets=4 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=376000 MemSpecLimit=10000 State=UNKNOWN PartitionName=prod Nodes=XXXX OverSubscribe=YES Default=YES MaxTime=INFINITE DefaultTime=2:0:0 State=UP