Hello, don't mind sharing the config at all. Not sure it helps though, it's pretty basic.
Picking an example node, I have [ ~]$ scontrol show node arcus-htc-gpu011 NodeName=arcus-htc-gpu011 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUTot=16 CPULoad=20.43 AvailableFeatures=cpu_gen:Haswell,cpu_sku:E5-2640v3,cpu_frq:2.60GHz,cpu_mem:64GB,gpu,gpu_mem:12GB,gpu_gen:Kepler,gpu_sku:K40,gpu_cc:3.5, ActiveFeatures=cpu_gen:Haswell,cpu_sku:E5-2640v3,cpu_frq:2.60GHz,cpu_mem:64GB,gpu,gpu_mem:12GB,gpu_gen:Kepler,gpu_sku:K40,gpu_cc:3.5, Gres=gpu:k40m:2 NodeAddr=arcus-htc-gpu011 NodeHostName=arcus-htc-gpu011 OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 RealMemory=63000 AllocMem=0 FreeMem=56295 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=96 Owner=N/A MCS_label=N/A Partitions=htc BootTime=2018-11-28T15:12:29 SlurmdStartTime=2018-11-28T17:58:55 CfgTRES=cpu=16,mem=63000M,billing=16 AllocTRES=cpu=16 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s gres.conf on arcus-htc-gpu011 is [ ~]$ cat /etc/slurm/gres.conf Name=gpu Type=k40m File=/dev/nvidia0 Name=gpu Type=k40m File=/dev/nvidia1 Relevant bits of slurm.conf are, I believe GresTypes=hbm,gpu (DebugFlags=Priority,Backfill,NodeFeatures,Gres,Protocol,TraceJobs) NodeName=arcus-htc-gpu009,arcus-htc-gpu[011-018] Weight=96 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=63000 Gres=gpu:k40m:2 Feature=cpu_gen:Haswell,cpu_sku:E5-2640v3,cpu_frq:2.60GHz,cpu_mem:64GB,gpu,gpu_mem:12GB,gpu_gen:Kepler,gpu_sku:K40,gpu_cc:3.5, Don't think I did anything else. I have other types of nodes - couple of P100s, couple of V100s, couple of K80s and one or two odd things (M40, P4). Used to run with a gres.conf that simply had 'Name=gpu File=/dev/nvidia[0-2]' (or [0-4], depending) and that also worked; I introduced the type when I gained a node that has two different nvidia cards, so what was on what port became important, not because the 'range' configuration caused problems. This wasn't a fresh install of 18.x - it was a 17.x installation that I upgraded to 18.x. Not sure if that makes a difference. I made no changes to anything (slurm.conf, gres.conf) with the update though. I just installed the new rpms. Tina On 05/12/2018 13:20, Lou Nicotra wrote: > Tina, thanks for confirming that GPU GRES resources work with 18.08... I > might just upgrade to 18.08.03 as I am running 18.08.0 > > The nvidia devices exists on all servers and persistence is set. They > have been in there for a number of years and our users make use of them > daily. I can actually see that slurmd knows about them while restarting > the daemon: > [2018-12-05T08:03:35.989] Slurmd shutdown completing > [2018-12-05T08:03:36.015] Message aggregation disabled > [2018-12-05T08:03:36.016] gpu device number 0(/dev/nvidia0):c 195:0 rwm > [2018-12-05T08:03:36.017] gpu device number 1(/dev/nvidia1):c 195:1 rwm > [2018-12-05T08:03:36.059] slurmd version 18.08.0 started > [2018-12-05T08:03:36.059] slurmd started on Wed, 05 Dec 2018 08:03:36 -0500 > [2018-12-05T08:03:36.059] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 > Memory=386757 TmpDisk=4758 Uptime=21324804 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > > Would you mind sharing the portions of the slurm.conf and corresponding > GRES definitions that you are using?. You have individual GRES files for > each server based on GPU type? I tried both, none of them work. > > My slurm.conf file has entries for GPUs as follows: > GresTypes=gpu > #AccountingStorageTRES=gres/gpu,gres/gpu:k20,gres/gpu:1080gtx > (currently commented out) > > gres.conf is as follows (had tried different configs, no change with > either one...) > # GPU Definitions > NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0 > Cores=0 > NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1 > Cores=1 > #NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx > File=/dev/nvidia[0-1] Cores=0,1 > > NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20 > File=/dev/nvidia0 Cores=0 > NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20 > File=/dev/nvidia1 Cores=1 > #NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20 > File=/dev/nvidia[0-1] Cores=0,1 > > What am I missing? > > Thanks... > > > > > On Wed, Dec 5, 2018 at 4:59 AM Tina Friedrich > <tina.friedr...@it.ox.ac.uk <mailto:tina.friedr...@it.ox.ac.uk>> wrote: > > I'm running 18.08.3, and I have a fair number of GPU GRES resources - > recently upgraded to 18.08.03 from a 17.x release. It's definitely not > as if they don't work in an 18.x release. (I do not distribute the same > gres.conf file everywhere though, never tried that.) > > Just a really stupid question - the /dev/nvidiaX devices do exist, I > assume? You are running nvidia-persistenced (or something similar) to > ensure the cards are up & the device files initialised etc? > > Tina > > On 04/12/2018 23:36, Brian W. Johanson wrote: > > Only thing to suggest once again is increasing the logging of both > > slurmctl and slurmd. > > As for downgrading, I wouldn't suggest running a 17.x slurmdbd > against a > > db built with 18.x. I imagine there are enough changes there to > cause > > trouble. > > I don't imagine downgrading will fix your issue, if you are running > > 18.08.0, the most recent release is 18.08.3. NEWS packed in the > > tarballs gives the fixes in the versions. I don't see any that > would > > fit you case. > > > > > > On 12/04/2018 02:11 PM, Lou Nicotra wrote: > >> Brian, I used a single gres.conf file and distributed to all > nodes... > >> Restarted all daemons, unfortunately scontrol still does not > show any > >> Gres resources for GPU nodes... > >> > >> Will try to roll back to 17.X release. Is it basically a matter of > >> removing 18.x rpms and installing 17's? Does the DB need to be > >> downgraded also? > >> > >> Thanks... > >> Lou > >> > >> On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson > <bjoha...@psc.edu <mailto:bjoha...@psc.edu> > >> <mailto:bjoha...@psc.edu <mailto:bjoha...@psc.edu>>> wrote: > >> > >> > >> Do one more pass through making sure > >> s/1080GTX/1080gtx and s/K20/k20 > >> > >> shutdown all slurmd, slurmctld, start slurmctl, start slurmd > >> > >> > >> I find it less confusing to have a global gres.conf file. I > >> haven't used a list (nvidia[0-1), mainly because I want to > specify > >> thethe cores to use for each gpu. > >> > >> gres.conf would look something like... > >> > >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80 > >> File=/dev/nvidia0 Cores=0 > >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80 > >> File=/dev/nvidia1 Cores=1 > >> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx > >> File=/dev/nvidia0 Cores=0 > >> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx > >> File=/dev/nvidia1 Cores=1 > >> > >> which can be distributed to all nodes. > >> > >> -b > >> > >> > >> On 12/04/2018 09:55 AM, Lou Nicotra wrote: > >>> Brian, the specific node does not show any gres... > >>> root@panther02 slurm# scontrol show partition=tiger_1 > >>> PartitionName=tiger_1 > >>> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL > >>> AllocNodes=ALL Default=YES QoS=N/A > >>> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO > >>> GraceTime=0 Hidden=NO > >>> MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO > >>> MaxCPUsPerNode=UNLIMITED > >>> Nodes=tiger[01-22] > >>> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO > >>> OverSubscribe=NO > >>> OverTimeLimit=NONE PreemptMode=OFF > >>> State=UP TotalCPUs=1056 TotalNodes=22 > SelectTypeParameters=NONE > >>> JobDefaults=(null) > >>> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED > >>> > >>> root@panther02 slurm# scontrol show node=tiger11 > >>> NodeName=tiger11 Arch=x86_64 CoresPerSocket=12 > >>> CPUAlloc=0 CPUTot=48 CPULoad=11.50 > >>> AvailableFeatures=HyperThread > >>> ActiveFeatures=HyperThread > >>> Gres=(null) > >>> NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08 > >>> OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 > 22:10:57 UTC 2015 > >>> RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1 > >>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A > >>> MCS_label=N/A > >>> Partitions=tiger_1,compute_1 > >>> BootTime=2018-04-02T13:30:12 > SlurmdStartTime=2018-12-03T16:13:22 > >>> CfgTRES=cpu=48,mem=1M,billing=48 > >>> AllocTRES= > >>> CapWatts=n/a > >>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > >>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > >>> > >>> So, something is not setup correctly... Could it be a 18.X bug? > >>> > >>> Thanks. > >>> > >>> > >>> On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra > >>> <lnico...@interactions.com > <mailto:lnico...@interactions.com> <mailto:lnico...@interactions.com > <mailto:lnico...@interactions.com>>> wrote: > >>> > >>> Thanks Michael. I will try 17.x as I also could not see > >>> anything wrong with my settings... Will report back > >>> afterwards... > >>> > >>> Lou > >>> > >>> On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico > >>> <mdidomeni...@gmail.com <mailto:mdidomeni...@gmail.com> > <mailto:mdidomeni...@gmail.com <mailto:mdidomeni...@gmail.com>>> wrote: > >>> > >>> unfortunately, someone smarter then me will have to > help > >>> further. I'm > >>> not sure i see anything specifically wrong. The one > >>> thing i might try > >>> is backing the software down to a 17.x release > series. I > >>> recently > >>> tried 18.x and had some issues. I can't say whether > >>> it'll be any > >>> different, but you might be exposing an undiagnosed bug > >>> in the 18.x > >>> branch > >>> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra > >>> <lnico...@interactions.com > <mailto:lnico...@interactions.com> > >>> <mailto:lnico...@interactions.com > <mailto:lnico...@interactions.com>>> wrote: > >>> > > >>> > Made the change in the gres.conf on local server file > >>> and restarted slurmd and slurmctld on master.... > >>> Unfortunately same error... > >>> > > >>> > Distributed corrected gres.conf to all k20 servers, > >>> restarted slurmd and slurmdctl... Still has same > error... > >>> > > >>> > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson > >>> <bjoha...@psc.edu <mailto:bjoha...@psc.edu> > <mailto:bjoha...@psc.edu <mailto:bjoha...@psc.edu>>> wrote: > >>> >> > >>> >> Is that a lowercase k in k20 specified in the batch > >>> script and nodename and a uppercase K specified in > gres.conf? > >>> >> > >>> >> On 12/03/2018 09:13 AM, Lou Nicotra wrote: > >>> >> > >>> >> Hi All, I have recently set up a slurm cluster > with my > >>> servers and I'm running into an issue while submitting > >>> GPU jobs. It has something to to with gres > >>> configurations, but I just can't seem to figure out > what > >>> is wrong. Non GPU jobs run fine. > >>> >> > >>> >> The error is as follows: > >>> >> sbatch: error: Batch job submission failed: Invalid > >>> Trackable RESource (TRES) specification after > submitting > >>> a batch job. > >>> >> > >>> >> My batch job is as follows: > >>> >> #!/bin/bash > >>> >> #SBATCH --partition=tiger_1 # partition name > >>> >> #SBATCH --gres=gpu:k20:1 > >>> >> #SBATCH --gres-flags=enforce-binding > >>> >> #SBATCH --time=0:20:00 # wall clock limit > >>> >> #SBATCH --output=gpu-%J.txt > >>> >> #SBATCH --account=lnicotra > >>> >> module load cuda > >>> >> python gpu1 > >>> >> > >>> >> Where gpu1 is a GPU test script that runs correctly > >>> while invoked via python. Tiger_1 partition has servers > >>> with GPUs, with a mix of 1080GTX and K20 as > specified in > >>> slurm.conf > >>> >> > >>> >> I have defined GRES resources in the slurm.conf > file: > >>> >> # GPU GRES > >>> >> GresTypes=gpu > >>> >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2 > >>> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] > >>> Gres=gpu:k20:2 > >>> >> > >>> >> And have a local gres.conf on the servers containing > >>> GPUs... > >>> >> lnicotra@tiger11 ~# cat /etc/slurm/gres.conf > >>> >> # GPU Definitions > >>> >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] > >>> Name=gpu Type=K20 File=/dev/nvidia[0-1] > >>> >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1 > >>> >> > >>> >> and a similar one for the 1080GTX > >>> >> # GPU Definitions > >>> >> # NodeName=tiger[01,05,10,15,20] Name=gpu > Type=1080GTX > >>> File=/dev/nvidia[0-1] > >>> >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] > Cores=0,1 > >>> >> > >>> >> The account manager seems to know about the GPUs... > >>> >> lnicotra@tiger11 ~# sacctmgr show tres > >>> >> Type Name ID > >>> >> -------- --------------- ------ > >>> >> cpu 1 > >>> >> mem 2 > >>> >> energy 3 > >>> >> node 4 > >>> >> billing 5 > >>> >> fs disk 6 > >>> >> vmem 7 > >>> >> pages 8 > >>> >> gres gpu 1001 > >>> >> gres gpu:k20 1002 > >>> >> gres gpu:1080gtx 1003 > >>> >> > >>> >> Can anyone point out what am I missing? > >>> >> > >>> >> Thanks! > >>> >> Lou > >>> >> > >>> >> > >>> >> -- > >>> >> > >>> >> Lou Nicotra > >>> >> > >>> >> IT Systems Engineer - SLT > >>> >> > >>> >> Interactions LLC > >>> >> > >>> >> o: 908-673-1833 > >>> >> > >>> >> m: 908-451-6983 > >>> >> > >>> >> lnico...@interactions.com > <mailto:lnico...@interactions.com> > >>> <mailto:lnico...@interactions.com > <mailto:lnico...@interactions.com>> > >>> >> > >>> >> www.interactions.com > <http://www.interactions.com> <http://www.interactions.com> > >>> >> > >>> >> > >>> > > ******************************************************************************* > >>> >> > >>> >> This e-mail and any of its attachments may contain > >>> Interactions LLC proprietary information, which is > >>> privileged, confidential, or subject to copyright > >>> belonging to the Interactions LLC. This e-mail is > >>> intended solely for the use of the individual or entity > >>> to which it is addressed. If you are not the intended > >>> recipient of this e-mail, you are hereby notified that > >>> any dissemination, distribution, copying, or action > taken > >>> in relation to the contents of and attachments to this > >>> e-mail is strictly prohibited and may be unlawful. > If you > >>> have received this e-mail in error, please notify the > >>> sender immediately and permanently delete the original > >>> and any copy of this e-mail and any printout. Thank > You. > >>> >> > >>> >> > >>> > > ******************************************************************************* > >>> >> > >>> >> > >>> > > >>> > > >>> > -- > >>> > > >>> > Lou Nicotra > >>> > > >>> > IT Systems Engineer - SLT > >>> > > >>> > Interactions LLC > >>> > > >>> > o: 908-673-1833 > >>> > > >>> > m: 908-451-6983 > >>> > > >>> > lnico...@interactions.com > <mailto:lnico...@interactions.com> > >>> <mailto:lnico...@interactions.com > <mailto:lnico...@interactions.com>> > >>> > > >>> > www.interactions.com > <http://www.interactions.com> <http://www.interactions.com> > >>> > > >>> > > >>> > > ******************************************************************************* > >>> > > >>> > This e-mail and any of its attachments may contain > >>> Interactions LLC proprietary information, which is > >>> privileged, confidential, or subject to copyright > >>> belonging to the Interactions LLC. This e-mail is > >>> intended solely for the use of the individual or entity > >>> to which it is addressed. If you are not the intended > >>> recipient of this e-mail, you are hereby notified that > >>> any dissemination, distribution, copying, or action > taken > >>> in relation to the contents of and attachments to this > >>> e-mail is strictly prohibited and may be unlawful. > If you > >>> have received this e-mail in error, please notify the > >>> sender immediately and permanently delete the original > >>> and any copy of this e-mail and any printout. Thank > You. > >>> > > >>> > > >>> > > ******************************************************************************* > >>> > >>> > >>> > >>> -- > >>> > >>> *Lou Nicotra* > >>> > >>> IT Systems Engineer - SLT > >>> > >>> Interactions LLC > >>> > >>> o: 908-673-1833 <tel:781-405-5114> > >>> > >>> m: 908-451-6983 <tel:781-405-5114> > >>> > >>> _lnico...@interactions.com > <mailto:lnico...@interactions.com> <mailto:lnico...@interactions.com > <mailto:lnico...@interactions.com>>_ > >>> > >>> www.interactions.com <http://www.interactions.com> > <http://www.interactions.com/> > >>> > >>> > >>> > >>> -- > >>> > >>> *Lou Nicotra* > >>> > >>> IT Systems Engineer - SLT > >>> > >>> Interactions LLC > >>> > >>> o: 908-673-1833 <tel:781-405-5114> > >>> > >>> m: 908-451-6983 <tel:781-405-5114> > >>> > >>> _lnico...@interactions.com > <mailto:lnico...@interactions.com> <mailto:lnico...@interactions.com > <mailto:lnico...@interactions.com>>_ > >>> > >>> www.interactions.com <http://www.interactions.com> > <http://www.interactions.com/> > >>> > >>> > > ******************************************************************************* > >>> > >>> This e-mail and any of its attachments may contain Interactions > >>> LLC proprietary information, which is privileged, confidential, > >>> or subject to copyright belonging to the Interactions LLC. This > >>> e-mail is intended solely for the use of the individual or > entity > >>> to which it is addressed. If you are not the intended recipient > >>> of this e-mail, you are hereby notified that any dissemination, > >>> distribution, copying, or action taken in relation to the > >>> contents of and attachments to this e-mail is strictly > prohibited > >>> and may be unlawful. If you have received this e-mail in error, > >>> please notify the sender immediately and permanently delete the > >>> original and any copy of this e-mail and any printout. > Thank You. > >>> > >>> > > ******************************************************************************* > >>> > >> > >> > >> > >> -- > >> > >> *Lou Nicotra* > >> > >> IT Systems Engineer - SLT > >> > >> Interactions LLC > >> > >> o: 908-673-1833 <tel:781-405-5114> > >> > >> m: 908-451-6983 <tel:781-405-5114> > >> > >> _lnico...@interactions.com <mailto:lnico...@interactions.com> > <mailto:lnico...@interactions.com <mailto:lnico...@interactions.com>>_ > >> > >> www.interactions.com <http://www.interactions.com> > <http://www.interactions.com/> > >> > >> > > ******************************************************************************* > >> > >> This e-mail and any of its attachments may contain Interactions LLC > >> proprietary information, which is privileged, confidential, or > subject > >> to copyright belonging to the Interactions LLC. This e-mail is > >> intended solely for the use of the individual or entity to which > it is > >> addressed. If you are not the intended recipient of this e-mail, > you > >> are hereby notified that any dissemination, distribution, > copying, or > >> action taken in relation to the contents of and attachments to this > >> e-mail is strictly prohibited and may be unlawful. If you have > >> received this e-mail in error, please notify the sender immediately > >> and permanently delete the original and any copy of this e-mail and > >> any printout. Thank You. > >> > >> > > ******************************************************************************* > >> > > > > > > -- > > *Lou Nicotra* > > IT Systems Engineer - SLT > > Interactions LLC > > o: 908-673-1833 <tel:781-405-5114> > > m: 908-451-6983 <tel:781-405-5114> > > _lnico...@interactions.com <mailto:lnico...@interactions.com>_ > > www.interactions.com <http://www.interactions.com/> > > ******************************************************************************* > > This e-mail and any of its attachments may contain Interactions LLC > proprietary information, which is privileged, confidential, or subject > to copyright belonging to the Interactions LLC. This e-mail is intended > solely for the use of the individual or entity to which it is addressed. > If you are not the intended recipient of this e-mail, you are hereby > notified that any dissemination, distribution, copying, or action taken > in relation to the contents of and attachments to this e-mail is > strictly prohibited and may be unlawful. If you have received this > e-mail in error, please notify the sender immediately and permanently > delete the original and any copy of this e-mail and any printout. Thank > You. > > ******************************************************************************* >