Re: [Beowulf] [External] Re: AMD and AVX512

Prentice Bisbal via Beowulf Wed, 16 Jun 2021 13:39:59 -0700

Scott (and Michael and Carlos),

Thanks for your excellent feedback. That's the kind of enlighteningfeedback I was looking for. Interesting that the HBM on Fugaku exceedsthe needs of the processor.


Prentice

On 6/16/21 2:23 PM, Scott Atchley wrote:

On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf<beowulf@beowulf.org <mailto:beowulf@beowulf.org>> wrote:
    Did anyone else attend this webinar panel discussion with AMD
    hosted by
    HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
    Success in HPC"

    https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
    <https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/>

    I attended it, and noticed there was no mention of AMD supporting
    AVX512, so during the question and answer portion of the program, I
    asked when AMD processors will support AVX512. The answer given,
    and I'm
    not making this up, is that AMD listens to their users and gives the
    users what they want, and right now they're not hearing any demand
    for
    AVX512.

    Personally, I call BS on that one. I can't imagine anyone in the HPC
    community saying "we'd like processors that offer only 1/2 the
    floating
    point performance of Intel processors". Sure, AMD can offer more
    cores,
    but with only AVX2, you'd need twice as many cores as Intel
    processors,
    all other things being equal.

    Last fall I evaluated potential new cluster nodes for a large cluster
    purchase using the HPL benchmark. I compared a server with dual
    AMD EPYC
    7H12 processors (128) cores to a server with quad Intel Xeon 8268
    processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
    only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
    64% of the Xeon 8268 system, despite having 33% more cores.

     From what I've heard, the AMD processors run much hotter than the
    Intel
    processors, too, so I imagine a FLOPS/Watt comparison would be
    even less
    favorable to AMD.

    An argument can be made that for calculations that lend themselves to
    vectorization should be done on GPUs, instead of the main
    processors but
    the last time I checked, GPU jobs are still memory is limited, and
    moving data in and out of GPU memory can still take time, so I can
    see
    situations where for large amounts of data using CPUs would be
    preferred
    over GPUs.

    Your thoughts?
--Prentice
AMD has studied this quite a bit in DOE's FastForward-2 andPathForward. I think Carlos' comment is on track. Having a unit thatcannot be fed data quick enough is pointless. It is applicationdependent. If your working set fits in cache, then the vector unitswork well. If not, you have to move data which stalls computepipelines. NERSC saw only a 10% increase in performance when movingfrom low core count Xeon CPUs with AVX2 to Knights Landing with manycores and AVX-512 when it should have seen an order of magnitudeincrease. Although Knights Landing had MCDRAM (Micron's not-quiteHBM), other constraints limited performance (e.g., lack of enoughmemory references in flight, coherence traffic).
Fujitsu's ARM64 chip with 512b SVE in Fugaku does much better thanXeon with AVX-512 (or Knights Landing) because of the High BandwidthMemory (HBM) attached and I assume a larger number of memoryreferences in flight. The downside is the lack of memory capacity(only 32 GB per node). This shows that it is possible to get moreperformance with a CPU with a 512b vector engine. That said, it is notclear that even this CPU design can extract the most from the memorybandwidth. If you look at the increase in memory bandwidth from Summitto Fugaku, one would expect performance on real apps to increase bythat amount as well. From the presentations that I have seen, that isnot always the case. For some apps, the GPU architecture, with itscoherence on demand rather than with every operation, can extract moreperformance.
AMD will add 512b vectors if/when it makes sense on real apps.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [External] Re: AMD and AVX512

Reply via email to