On 01/17/2019 06:36 PM, Ryan Novosielski wrote:
I don’t actually know the answer to this one, but we have it provisioned to all
nodes.
Note that if you care about node weights (eg. NodeName=whatever001 Weight=2,
etc. in slurm.conf), using the topology function will disable it. I believe I
was promised a warning about that in the future in a conversation with SchedMD.
Well, that's going to be a big problem for me. One of the goals of me
overhauling our Slurm config is to take advantage of the node weighting
function to prioritize certain hardware over others in our very
heterogeneous cluster.
I may have to provide a larger description of my hardware/situation to
the list and ask for suggestions on how to best handle the problem.
Prentice
On Jan 17, 2019, at 4:52 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:
And a follow-up question: Does topology.conf need to be on all the nodes, or
just the slurm controller? It's not clear from that web page. I would assume
only the controller needs it.
Prentice
On 1/17/19 4:49 PM, Prentice Bisbal wrote:
From https://slurm.schedmd.com/topology.html:
Note that compute nodes on switches that lack a common parent switch can be used, but no
job will span leaf switches without a common parent (unless the
TopologyParam=TopoOptional option is used). For example, it is legal to remove the line
"SwitchName=s4 Switches=s[0-3]" from the above topology.conf file. In that
case, no job will span more than four compute nodes on any single leaf switch. This
configuration can be useful if one wants to schedule multiple phyisical clusters as a
single logical cluster under the control of a single slurmctld daemon.
My current environment falls into the category of multiple physical clusters
being treated as a single logical cluster under the control of a single
slurmctld daemon. At least, that's my goal.
In my environment, I have 2 "clusters" connected by their own separate IB fabrics, and
one "cluster" connected with 10 GbE. I have a fourth cluster connected with only 1GbE.
For this 4th cluster, we don't want jobs to span nodes, due to the slow performance of 1 GbE. (This
cluster is intended for serial and low-core count parallel jobs) If I just leave those nodes out of
the topology.conf file, will that have the desired affect of not allocating multi-node jobs to
those nodes, or will it result in an error of some sort?