Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Brian W. Johanson
Only thing to suggest once again is increasing the logging of both slurmctl and slurmd. As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a db built with 18.x.  I imagine there are enough changes there to cause trouble. I don't imagine downgrading will fix your issue, if you

[slurm-users] Federation and bursting to cloud

2018-12-04 Thread Sajesh Singh
We are currently investigating a switch to SLURM from PBS and I have a question on the interoperability of two features. Would the implementation of federation in SLURM affect any of the clusters' ability to burst to the cloud? -SS-

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Lou Nicotra
Brian, I used a single gres.conf file and distributed to all nodes... Restarted all daemons, unfortunately scontrol still does not show any Gres resources for GPU nodes... Will try to roll back to 17.X release. Is it basically a matter of removing 18.x rpms and installing 17's? Does the DB need to

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-12-04 Thread Christopher Benjamin Coffey
Interesting! I'll have a look - thanks! — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 11/30/18, 1:41 AM, "slurm-users on behalf of John Hearns" wrote: Chris, I have delved deep into the OOM killer code and interaction with cpusets in the p

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Brian W. Johanson
Do one more pass through making sure s/1080GTX/1080gtx and s/K20/k20 shutdown all slurmd, slurmctld, start slurmctl, start slurmd I find it less confusing to have a global gres.conf file. I haven't used a list (nvidia[0-1), mainly because I want to specify thethe cores to use for each gpu.

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Lou Nicotra
Brian, the specific node does not show any gres... root@panther02 slurm# scontrol show partition=tiger_1 PartitionName=tiger_1 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO Max

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Lou Nicotra
Thanks Michael. I will try 17.x as I also could not see anything wrong with my settings... Will report back afterwards... Lou On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico wrote: > unfortunately, someone smarter then me will have to help further. I'm > not sure i see anything specifically

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Brian W. Johanson
As Michael had suggested earlier, debugflags=gpu will give you detailed output of the gres being reported by the nodes.  This would be in the slurmctld log. Or, show us the output of 'scontrol show node=tiger[01-02]' and 'scontrol show partition=tiger_1' From your previous message, that should

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Michael Di Domenico
unfortunately, someone smarter then me will have to help further. I'm not sure i see anything specifically wrong. The one thing i might try is backing the software down to a 17.x release series. I recently tried 18.x and had some issues. I can't say whether it'll be any different, but you might