Re: [Beowulf] [EXTERNAL] Re: Frontier Announcement

Prentice Bisbal via Beowulf Wed, 08 May 2019 08:51:27 -0700


On 5/7/19 6:14 PM, Lux, Jim (337K) wrote:


On 5/7/19, 2:00 PM, "Beowulf on behalf of Prentice Bisbal via Beowulf" 
<beowulf-boun...@beowulf.org on behalf of beowulf@beowulf.org> wrote:

     >   I think it is interesting that they are using AMD for
     > both the CPUs and GPUs

I agree. That means a LOT of codes will have to be ported from CUDA to

     whatever AMD uses. I know AMD announced their HIP interface to convert
     CUDA code into something that will run on AMD processors, but I don't
     know how well that works in theory. Frankly, I haven't heard anything
     about it since it was announced at SC a few years ago.

I would not be surprised if AMD pursued this bid quite agressively,

     possibly at a significant loss, for the opportunity to prove their GPUs
     can compete with NVIDIA and demonstrate that codes can be successfully
     converted from CUDA to something AMD GPUs can use to demonstrate GPU
     users don't need to be locked in to a single vendor. If so, this could
     be a costly gamble for the DOE and AMD, but if it pays off, I imagine it
     could change AMD's fortunes in HPC.

"Win on Sunday, sell on Monday" doesn't apply just to cars.Prentice

--
I think they're deliberately looking for architectural diversity, rather than "ease 
of porting from existing machine"

" CORAL-2 has a mandate to field architecturally diverse machines in a way that 
manages risk during a period of rapid technological evolution. “Regardless of which 
system or systems are being discussed, the systems residing at or planned to reside at 
ORNL and ANL must be diverse from one another,” notes the CORAL-2 RFP cover letter 
[PDF]."

https://asc.llnl.gov/coral-2-benchmarks/

I understand the requirement for architetcural diversity. The 3 DOELeadership Computing Facilities (LCFs) have always practiced hardwarediversity. ANL typically used IBM Hardware in the form of Blue Genes(Intrepid, Miro), and ORNL typically used Cray. Those two sites usedbleeding-edge architectures, and NERSRC,the 3rd DOE LCF, would usuallygo with less bleeding-edge systems.

However this particular choice brings the risk of users not being ableto, or not wanting to port their code to a unique architecture. Not onlyis it different than past DOE Leadership systems, it is using anarchitecture that currently has about 0% market share, so the work ofporting code to this architecture to run on a single system may not beenough incentive for some users, despite the performance advantage,since the cost of that effort can't be spread over a larger number of other systems they can now use. (based on current market trends, at least)

LANL's RoadRunner is a good analog to consider. It was the firstpetascale system, but it had a rather unique architecture. The DOEdecommissioned the system when it was about 5 years old, even though itwas still ranked quite highly on the Top500. It's replacement was Cielo,which wasn't much newer or faster than RoadRunner. From conversationsI've had with people familiar with RoadRunner, I heard it was difficultto program, and too expensive to continue supporting. I don't know howaccurate those statements are, because I don't remember the DOE sayingmuch about why they EOLed RoadRunner, but thos explanations seemedreasonable.

And yes, I know DOE LCF systems are a bit unique in the market theyserve - their users are bleeding-edge users who probably are willing toport their codes to new or unique architectures for the benefit of morecompute capabilities, but I think it's safe to say Roadrunner's userbase had the same or very similar characteristics.


Prentice


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [EXTERNAL] Re: Frontier Announcement

Reply via email to