Re: [Beowulf] [External] anyone have modern interconnect metrics?

Prentice Bisbal via Beowulf Mon, 22 Jan 2024 08:16:44 -0800

Scott,

On 1/20/24 12:10 PM, Scott Atchley wrote:

On Fri, Jan 19, 2024 at 9:40 PM Prentice Bisbal via Beowulf<beowulf@beowulf.org> wrote:
    > Yes, someone is sure to say "don't try characterizing all that
    stuff -
    > it's your application's performance that matters!" Alas, we're a
    generic
    > "any kind of research computing" organization, so there are
    thousands
    > of apps
    > across all possible domains.

    <rant>

    I agree with you. I've always hated the "it depends on your
    application"
    stock response in HPC. I think it's BS. Very few of us work in an
    environment where we support only a handful of applications with very
    similar characteristics. I say use standardized benchmarks that test
    specific performance metrics (mem bandwidth or mem latency, etc.),
    first, and then use a few applications to confirm what you're seeing
    with those benchmarks.

    </rant>
It does depend on the application(s). At OLCF, we have hundreds ofapplications. Some pound the network and some do not. Because we are aLeadership Computing Facility, a user cannot get any time on themachine unless they can scale to 20% and ideally to 100% of thesystem. We have several apps with FFTs which become all-to-alls inMPI. Because of this, ideally we want a non-blocking fat-tree (i.e.,Clos) topology. Every other topology is a compromise. That said, afull Clos is 2x or more in cost compared to other common topologies(e.g., dragonfly or a 2:1 oversubscribed, fat-tree). If your workloadis small jobs that can fit in a rack, for example, then by all meanssave some money and get an oversubscribed fat-tree, dragonfly, etc. Ifyour jobs need to use the full machine and they have large messagecollectives, then you have to bite the bullet and spend more onnetwork and less on compute and/or storage.
To assess the usage of our parallel file systems, we run with Darshaninstalled and it captures data from each MPI job (each job step withina job). We do not have similar tools to determine how the network isbeing used (e.g., how much bandwidth do we need, what communicationpatterns). When I was at Myricom and we were releasing Myri-10G, Ibenchmarked several ISV codes on 2G versus 10G. If I remember, Fluentdid not benefit from the extra bandwidth, but PowerFlow did a lot.
My point is that "It depends" may not be a satisfying answer, but itis realistic.

I don't disagree with you that different apps stress a cluster indifferent ways. I've seen a lot of that myself. What I'm saying is thatdesigning a cluster around only a handful of applications is notpractical or possible for most clusters, since the same cluster willmost likely be supporting apps at different ends of the spectrum(s).I've had numerous discussions with users who don't think IB is worth itbecause if we by Ethernet we can more cores. That may be fine for theirembarrassingly parallel application, but what about the user with thetightly-coupled MD application?

I always recommend going with the best networking you can afford,because having better networking won't hurt the apps that don't need it,but the apps that DO need it will definitely notice it when it's not there.

Like you,I have seen the cost difference in going from non-blocking to2:1 oversubscription. Once you get beyond a couple of switches, itbecomes significantly more money to go from 2:1 to non-blocking. Whengoing from 2:1 to 3:1, though, the savings isn't really as much (atleast for the cluster sized I've spec'ed out), so it doesn't seem worthit go from 2:1 to 3:1. Going non-blocking within a rack and going withoversubscription between racks (like SDSC did with the Comet cluster)isn't that bad an idea if budget is an issue.

    > Another interesting topic is that nodes are becoming many-core -
    any
    > thoughts?

    Core counts are getting too high to be of use in HPC. High core-count
    processors sound great until you realize that all those cores are now
    competing for same memory bandwidth and network bandwidth, neither of
    which increase with core-count.

    Last April we were evaluating test systems from different vendors
    for a
    cluster purchase. One of our test users does a lot of CFD simulations
    that are very sensitive to mem bandwidth. While he was getting a 50%
    speed up in AMD compared to Intel (which makes sense since AMDs
    require
    12 DIMM slots to be filled instead of Intel's 8), he asked us
    consider
    servers with LESS cores. Even with the AMDs, he was saturating the
    memory bandwidth before scaling to all the cores, causing his
    performance to plateau. For him, buying cheaper processors with lower
    core-counts was better for him, since the savings would allow us
    to by
    additional nodes, which would be more beneficial to him.

We see this as well in DOE especially when GPUs are doing asignificant amount of the work.

Yeah, I noticed that Frontier and Aurora will actually be single-socketsystems w/ "only" 64 cores.


Scott

    <snip>
    --
    Prentice


    On 1/16/24 5:19 PM, Mark Hahn wrote:
    > Hi all,
    > Just wondering if any of you have numbers (or experience) with
    > modern high-speed COTS ethernet.
    >
    > Latency mainly, but perhaps also message rate.  Also ease of use
    > with open-source products like OpenMPI, maybe Lustre?
    > Flexibility in configuring clusters in the >= 1k node range?
    >
    > We have a good idea of what to expect from Infiniband offerings,
    > and are familiar with scalable network topologies.
    > But vendors seem to think that high-end ethernet (100-400Gb) is
    > competitive...
    >
    > For instance, here's an excellent study of Cray/HP Slingshot
    (non-COTS):
    > https://arxiv.org/pdf/2008.08886.pdf
    > (half rtt around 2 us, but this paper has great stuff about
    > congestion, etc)
    >
    > Yes, someone is sure to say "don't try characterizing all that
    stuff -
    > it's your application's performance that matters!" Alas, we're a
    generic
    > "any kind of research computing" organization, so there are
    thousands
    > of apps
    > across all possible domains.
    >
    > Another interesting topic is that nodes are becoming many-core -
    any
    > thoughts?
    >
    > Alternatively, are there other places to ask? Reddit or
    something less
    > "greybeard"?
    >
    > thanks, mark hahn
    > McMaster U / SharcNET / ComputeOntario / DRI Alliance Canada
    >
    > PS: the snarky name "NVidiband" just occurred to me; too soon?
    > _______________________________________________
    > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
    Computing
    > To change your subscription (digest mode or unsubscribe) visit
    > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
    _______________________________________________
    Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
    Computing
    To change your subscription (digest mode or unsubscribe) visit
    https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [External] anyone have modern interconnect metrics?

Reply via email to