Only thing to suggest once again is increasing the logging of both slurmctl and
slurmd.
As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a db
built with 18.x. I imagine there are enough changes there to cause trouble.
I don't imagine downgrading will fix your issue, if you
We are currently investigating a switch to SLURM from PBS and I have a question
on the interoperability of two features. Would the implementation of federation
in SLURM affect any of the clusters' ability to burst to the cloud?
-SS-
Brian, I used a single gres.conf file and distributed to all nodes...
Restarted all daemons, unfortunately scontrol still does not show any Gres
resources for GPU nodes...
Will try to roll back to 17.X release. Is it basically a matter of removing
18.x rpms and installing 17's? Does the DB need to
Interesting! I'll have a look - thanks!
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 11/30/18, 1:41 AM, "slurm-users on behalf of John Hearns"
wrote:
Chris, I have delved deep into the OOM killer code and interaction with
cpusets in the p
Do one more pass through making sure
s/1080GTX/1080gtx and s/K20/k20
shutdown all slurmd, slurmctld, start slurmctl, start slurmd
I find it less confusing to have a global gres.conf file. I haven't used a list
(nvidia[0-1), mainly because I want to specify thethe cores to use for each gpu.
Brian, the specific node does not show any gres...
root@panther02 slurm# scontrol show partition=tiger_1
PartitionName=tiger_1
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
Max
Thanks Michael. I will try 17.x as I also could not see anything wrong with
my settings... Will report back afterwards...
Lou
On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
wrote:
> unfortunately, someone smarter then me will have to help further. I'm
> not sure i see anything specifically
As Michael had suggested earlier, debugflags=gpu will give you detailed output
of the gres being reported by the nodes. This would be in the slurmctld log.
Or, show us the output of 'scontrol show node=tiger[01-02]' and 'scontrol show
partition=tiger_1'
From your previous message, that should
unfortunately, someone smarter then me will have to help further. I'm
not sure i see anything specifically wrong. The one thing i might try
is backing the software down to a 17.x release series. I recently
tried 18.x and had some issues. I can't say whether it'll be any
different, but you might