On 01/17/2019 07:55 PM, Fulcomer, Samuel wrote:
We use topology.conf to segregate architectures (Sandy->Skylake), and also to isolate individual nodes with 1Gb/s Ethernet rather than IB (older GPU nodes with deprecated IB cards). In the latter case, topology.conf had a switch entry for each node.
So Slurm thinks each node has its own switch that is not shared with any other node?

It used to be the case that SLURM was unhappy with nodes defined in slurm.conf not appearing in topology.conf. This may have changed....

On Thu, Jan 17, 2019 at 6:37 PM Ryan Novosielski <novos...@rutgers.edu <mailto:novos...@rutgers.edu>> wrote:

    I don’t actually know the answer to this one, but we have it
    provisioned to all nodes.

    Note that if you care about node weights (eg. NodeName=whatever001
    Weight=2, etc. in slurm.conf), using the topology function will
    disable it. I believe I was promised a warning about that in the
    future in a conversation with SchedMD.

    > On Jan 17, 2019, at 4:52 PM, Prentice Bisbal <pbis...@pppl.gov
    <mailto:pbis...@pppl.gov>> wrote:
    >
    > And a follow-up question: Does topology.conf need to be on all
    the nodes, or just the slurm controller? It's not clear from that
    web page. I would assume only the controller needs it.
    >
    > Prentice
    >
    > On 1/17/19 4:49 PM, Prentice Bisbal wrote:
    >> From https://slurm.schedmd.com/topology.html:
    >>
    >>> Note that compute nodes on switches that lack a common parent
    switch can be used, but no job will span leaf switches without a
    common parent (unless the TopologyParam=TopoOptional option is
    used). For example, it is legal to remove the line "SwitchName=s4
    Switches=s[0-3]" from the above topology.conf file. In that case,
    no job will span more than four compute nodes on any single leaf
    switch. This configuration can be useful if one wants to schedule
    multiple phyisical clusters as a single logical cluster under the
    control of a single slurmctld daemon.
    >>
    >> My current environment falls into the category of multiple
    physical clusters being treated as a single logical cluster under
    the control of a single slurmctld daemon. At least, that's my goal.
    >>
    >> In my environment, I have 2 "clusters" connected by their own
    separate IB fabrics, and one "cluster" connected with 10 GbE. I
    have a fourth cluster connected with only 1GbE. For this 4th
    cluster, we don't want jobs to span nodes, due to the slow
    performance of 1 GbE. (This cluster is intended for serial and
    low-core count parallel jobs) If I just leave those nodes out of
    the topology.conf file, will that have the desired affect of not
    allocating multi-node jobs to those nodes, or will it result in an
    error of some sort?
    >>
    >


Reply via email to