Re: [Beowulf] [External] anyone have modern interconnect metrics?

Prentice Bisbal via Beowulf Mon, 22 Jan 2024 08:54:54 -0800


On 1/22/24 11:38 AM, Scott Atchley wrote:

On Mon, Jan 22, 2024 at 11:16 AM Prentice Bisbal <pbis...@pppl.gov> wrote:

    <snip>

        > Another interesting topic is that nodes are becoming
        many-core - any
        > thoughts?

        Core counts are getting too high to be of use in HPC. High
        core-count
        processors sound great until you realize that all those cores
        are now
        competing for same memory bandwidth and network bandwidth,
        neither of
        which increase with core-count.

        Last April we were evaluating test systems from different
        vendors for a
        cluster purchase. One of our test users does a lot of CFD
        simulations
        that are very sensitive to mem bandwidth. While he was
        getting a 50%
        speed up in AMD compared to Intel (which makes sense since
        AMDs require
        12 DIMM slots to be filled instead of Intel's 8), he asked us
        consider
        servers with LESS cores. Even with the AMDs, he was
        saturating the
        memory bandwidth before scaling to all the cores, causing his
        performance to plateau. For him, buying cheaper processors
        with lower
        core-counts was better for him, since the savings would allow
        us to by
        additional nodes, which would be more beneficial to him.


    We see this as well in DOE especially when GPUs are doing a
    significant amount of the work.


    Yeah, I noticed that Frontier and Aurora will actually be
    single-socket systems w/ "only" 64 cores.

Yes, Frontier is a *single* *CPU* socket and *four GPUs* (actuallyeight GPUs from the user's perspective). It works out to eight coresper Graphics Compute Die (GCD). The FLOPS ratio is roughly 1:100between the CPU and GPUs.

Note, Aurora is a dual CPU and six GPU. I am not sure if the user seessix or more GPUs. The Aurora node is similar to our Summit node butwith more connectivity between the GPUs.

Thanks for clarfying! I thought it was a single-CPU system likeFrontier. Not only is the FLOPS ratio much higher on GPUs, so if theFLOPS/W ratio. Even though CPUs have gotten much more efficient lately,it's practically stagnant compared to GPU-based clusters, based on myanalysis of the Top500 and Green500 trends.


Prentice

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [External] anyone have modern interconnect metrics?

Reply via email to