Re: [slurm-users] GRES GPU issues

2018-12-06 Thread Tina Friedrich
The available features / constraints aren't necessary; their purpose is to offer a slightly more flexible way to request resources (esp. GPU). As in, quite often people don't specifically need a P100 or V100, but they can't run on a Kepler card; with the '--gres=gpu:p100:X' syntax they can (I b

Re: [slurm-users] GRES GPU issues

2018-12-05 Thread Lou Nicotra
OK, after looking at your configs, I noticed that I was missing a "Gres=gpu" entry on my Nodename definition. Added and distributed... NodeName=tiger11 NodeAddr=X.X.X.X Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=gpu:1080gtx:0,gpu:k20:1 Feature=HyperThread State=UNKNOWN Assuming that 0 and 1

Re: [slurm-users] GRES GPU issues

2018-12-05 Thread Tina Friedrich
Hello, don't mind sharing the config at all. Not sure it helps though, it's pretty basic. Picking an example node, I have [ ~]$ scontrol show node arcus-htc-gpu011 NodeName=arcus-htc-gpu011 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUTot=16 CPULoad=20.43 AvailableFeatures=cpu_gen:Haswell,

Re: [slurm-users] GRES GPU issues

2018-12-05 Thread Lou Nicotra
Tina, thanks for confirming that GPU GRES resources work with 18.08... I might just upgrade to 18.08.03 as I am running 18.08.0 The nvidia devices exists on all servers and persistence is set. They have been in there for a number of years and our users make use of them daily. I can actually see th

Re: [slurm-users] GRES GPU issues

2018-12-05 Thread Tina Friedrich
I'm running 18.08.3, and I have a fair number of GPU GRES resources - recently upgraded to 18.08.03 from a 17.x release. It's definitely not as if they don't work in an 18.x release. (I do not distribute the same gres.conf file everywhere though, never tried that.) Just a really stupid question

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Brian W. Johanson
Only thing to suggest once again is increasing the logging of both slurmctl and slurmd. As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a db built with 18.x.  I imagine there are enough changes there to cause trouble. I don't imagine downgrading will fix your issue, if you

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Lou Nicotra
Brian, I used a single gres.conf file and distributed to all nodes... Restarted all daemons, unfortunately scontrol still does not show any Gres resources for GPU nodes... Will try to roll back to 17.X release. Is it basically a matter of removing 18.x rpms and installing 17's? Does the DB need to

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Brian W. Johanson
Do one more pass through making sure s/1080GTX/1080gtx and s/K20/k20 shutdown all slurmd, slurmctld, start slurmctl, start slurmd I find it less confusing to have a global gres.conf file. I haven't used a list (nvidia[0-1), mainly because I want to specify thethe cores to use for each gpu.

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Lou Nicotra
Brian, the specific node does not show any gres... root@panther02 slurm# scontrol show partition=tiger_1 PartitionName=tiger_1 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO Max

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Lou Nicotra
Thanks Michael. I will try 17.x as I also could not see anything wrong with my settings... Will report back afterwards... Lou On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico wrote: > unfortunately, someone smarter then me will have to help further. I'm > not sure i see anything specifically

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Brian W. Johanson
As Michael had suggested earlier, debugflags=gpu will give you detailed output of the gres being reported by the nodes.  This would be in the slurmctld log. Or, show us the output of 'scontrol show node=tiger[01-02]' and 'scontrol show partition=tiger_1' From your previous message, that should

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Michael Di Domenico
unfortunately, someone smarter then me will have to help further. I'm not sure i see anything specifically wrong. The one thing i might try is backing the software down to a 17.x release series. I recently tried 18.x and had some issues. I can't say whether it'll be any different, but you might

Re: [slurm-users] GRES GPU issues

2018-12-03 Thread Lou Nicotra
Made the change in the gres.conf on local server file and restarted slurmd and slurmctld on master Unfortunately same error... Distributed corrected gres.conf to all k20 servers, restarted slurmd and slurmdctl... Still has same error... On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson wrot

Re: [slurm-users] GRES GPU issues

2018-12-03 Thread Brian W. Johanson
Is that a lowercase k in k20 specified in the batch script and nodename and a uppercase K specified in gres.conf? On 12/03/2018 09:13 AM, Lou Nicotra wrote: Hi All, I have recently set up a slurm cluster with my servers and I'm running into an issue while submitting GPU jobs. It has something t

Re: [slurm-users] GRES GPU issues

2018-12-03 Thread Lou Nicotra
Here you go... Thanks for looking into this... lnicotra@tiger11 run# scontrol show config Configuration data as of 2018-12-03T15:39:51 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = panther02 AccountingStorageLoc= N/A AccountingStoragePort = 681

Re: [slurm-users] GRES GPU issues

2018-12-03 Thread Michael Di Domenico
are you willing to paste an `scontrol show config` from the machine having trouble On Mon, Dec 3, 2018 at 12:10 PM Lou Nicotra wrote: > > I'm running slurmd version 18.08.0... > > It seems that the system recognizes the GPUs after a slurmd restart. I tuned > debug to 5, restarted and then submit

Re: [slurm-users] GRES GPU issues

2018-12-03 Thread Lou Nicotra
I'm running slurmd version 18.08.0... It seems that the system recognizes the GPUs after a slurmd restart. I tuned debug to 5, restarted and then submitted job. Nothing get logged to log file in local server... [2018-12-03T11:55:18.442] Slurmd shutdown completing [2018-12-03T11:55:18.484] debug:

Re: [slurm-users] GRES GPU issues

2018-12-03 Thread Michael Di Domenico
do you get anything additional in the slurm logs? have you tried adding gres to the debugflags? what version of slurm are you running? On Mon, Dec 3, 2018 at 9:18 AM Lou Nicotra wrote: > > Hi All, I have recently set up a slurm cluster with my servers and I'm > running into an issue while submi

[slurm-users] GRES GPU issues

2018-12-03 Thread Lou Nicotra
Hi All, I have recently set up a slurm cluster with my servers and I'm running into an issue while submitting GPU jobs. It has something to to with gres configurations, but I just can't seem to figure out what is wrong. Non GPU jobs run fine. The error is as follows: sbatch: error: Batch job submi