We managed to resolve this as follows: gres.conf changes:
-NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0 -NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1 +NodeName=boole-n024 Name=gpu Type=rtx2080ti File=/dev/nvidia0 +NodeName=boole-n024 Name=gpu Type=rtx2080ti File=/dev/nvidia1 slurm.conf -PartitionName=long Nodes=boole-n[001-006],boole-n017,boole-n[018-020],boole-n[025] MaxTime=10-00:00:00 State=UP Sha +PartitionName=long Nodes=boole-n[001-006],boole-n[016,017],boole-n[018-020],boole-n[025] MaxTime=10-00:00:00 State= So it seems that starting the name of the gres type with a number instead of a letter is what the problem was. Thanks Sean On Fri, Jan 11, 2019 at 04:51:38PM +0000, Sean McGrath wrote: > I forgot to mention we are running slurm version 18.08.3 before. > > On Fri, Jan 11, 2019 at 10:35:09AM -0500, Paul Edmon wrote: > > > I'm pretty sure that gres.conf has to be on all the nodes as well > > and not just the master. > > Thanks Paul. We deploy the same slurm configuration, including the gres.conf > file, cluster wide. I've double checked the node in question and it has the > correct gres.conf. > > Best > > Sean > > > > > -Paul Edmon- > > > > On 1/11/19 5:21 AM, Sean McGrath wrote: > > >Hi everyone, > > > > > >Your help for this would be much appreciated please. > > > > > >We have a cluster with 3 types of gpu configured in gres. Users can > > >successfully > > >request 2 of the gpu types but the third errors when requested. > > > > > >Here is the successful salloc behaviour: > > > > > >root@boole01:/etc/slurm # salloc --gres=gpu:tesla:1 -N 1 > > >salloc: Granted job allocation 271558 > > >[root@boole-n019:/etc/slurm]# exit > > >salloc: Relinquishing job allocation 271558 > > >root@boole01:/etc/slurm # salloc --gres=gpu:volta:1 -N 1 > > >salloc: Pending job allocation 271559 > > >salloc: job 271559 queued and waiting for resources > > >^Csalloc: Job allocation 271559 has been revoked. > > > > > >And the unsuccessful salloc behaviour: > > > > > >root@boole01:/etc/slurm # salloc --gres=gpu:2080ti:1 -N 1 > > >salloc: error: Job submit/allocate failed: Invalid generic resource (gres) > > >specification > > > > > >Slurm.log output for successful salloc's: > > > > > >[2019-01-11T10:13:36.434] sched: _slurm_rpc_allocate_resources JobId=271558 > > >NodeList=boole-n019 usec=30495 > > >[2019-01-11T10:13:42.485] _job_complete: JobId=271558 WEXITSTATUS 0 > > >[2019-01-11T10:13:42.486] _job_complete: JobId=271558 done > > >[2019-01-11T10:13:46.000] sched: _slurm_rpc_allocate_resources JobId=271559 > > >NodeList=(null) usec=15674 > > >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 WTERMSIG 126 > > >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 cancelled by > > >interactive > > >user > > >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 done > > > > > >Slurm.log output for unsuccessful salloc's: > > > > > >[2019-01-11T10:13:55.755] _get_next_job_gres: Invalid GRES job > > >specification > > >gpu:2080ti:1 > > >[2019-01-11T10:13:55.755] _slurm_rpc_allocate_resources: Invalid generic > > >resource (gres) specification > > > > > > > > >Slurm gres configuration: > > > > > >root@boole01:/etc/slurm # grep -i gres slurm.conf | grep -v ^# > > >GresTypes=gpu,mic > > >NodeName=boole-n[018-023] Gres=gpu:tesla:2 RealMemory=256000 Sockets=2 > > >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=50 > > >NodeName=boole-n024 Gres=gpu:2080ti:2 RealMemory=256000 Sockets=2 > > >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=100 > > >NodeName=boole-n016 Gres=gpu:volta:2 RealMemory=256000 Sockets=2 > > >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=200 > > > > > >gres.conf: > > > > > >root@boole01:/etc/slurm # cat gres.conf > > >NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia0 > > >NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia1 > > >NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0 > > >NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1 > > >NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia0 > > >NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia1 > > >#NodeName=boole-n017 Name=mic File=/dev/mic0 > > >#NodeName=boole-n017 Name=mic File=/dev/mic1 > > > > > >Please let me know if there is anymore info that would be helpful for this? > > > > > >What am I missing or doing wrong? > > > > > >Many thanks in advance. > > > > > >Sean > > > > > > > > > > -- > Sean McGrath M.Sc > > Systems Administrator > Trinity Centre for High Performance and Research Computing > Trinity College Dublin > > sean.mcgr...@tchpc.tcd.ie > > https://www.tcd.ie/ > https://www.tchpc.tcd.ie/ > > +353 (0) 1 896 3725 > > -- Sean McGrath M.Sc Systems Administrator Trinity Centre for High Performance and Research Computing Trinity College Dublin sean.mcgr...@tchpc.tcd.ie https://www.tcd.ie/ https://www.tchpc.tcd.ie/ +353 (0) 1 896 3725