Re: [Beowulf] Working for DUG, new thead

Joe Landman Tue, 19 Jun 2018 12:11:08 -0700


On 6/19/18 2:47 PM, Prentice Bisbal wrote:

On 06/13/2018 10:32 PM, Joe Landman wrote:
I'm curious about your next gen plans, given Phi's roadmap.


On 6/13/18 9:17 PM, Stu Midgley wrote:
low level HPC means... lots of things. BUT we are a huge Xeon Phishop and need low-level programmers ie. avx512, careful cache/memorymanagement (NOT openmp/compiler vectorisation etc).
I played around with avx512 in my rzf code.https://github.com/joelandman/rzf/blob/master/avx2/rzf_avx512.c . Never really spent a great deal of time on it, other than noting thatusing avx512 seemed to downclock the core a bit on Skylake.
If you organize your code correctly, and call the compiler with theright optimization flags, shouldn't the compiler automatically handlea good portion of this 'low-level' stuff?

I wish it would do it well, but it turns out it doesn't do a good job. You have to pay very careful attention to almost all aspects of makingit simple for the compiler, and then constraining the directions ittakes with code gen.

I explored this with my RZF stuff. It turns out that with -O3, gcc (5.xand 6.x) would convert a library call for the power function into an FPinstruction. But it would use 1/8 - 1/4 of the XMM/YMM register width,not automatically unroll loops, or leverage the vector nature of theproblem.

Basically, not much has changed in 20+ years ... you annotate your codewith pragmas and similar, or use instruction primitives and give up onthe optimizer/code generator.

When it comes down to it, compilers aren't really as smart as many of uswould like. Converting idiomatic code into efficient assembly isn'twhat they are designed for. Rather correct assembly. Correct doesn'tmean efficient in many cases, and some of the less obvious optimizationsthat we might think to be beneficial are not taken. We can hand modifythe code for this, and see if these optimizations are beneficial, butthe compilers often are not looking at a holistic problem.

I understand that hand-coding this stuff usually still give you thebest performance (See GotoBLAS/OpenBLAS, for example), but does youraverage HPC programmer trying to get decent performance need tohand-code that stuff, too?

Generally, yes. Optimizing serial code for GPUs doesn't work well.Rewriting for GPUs (e.g. taking into account the GPU data/compute flowarchitecture) does work well.


--

Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Working for DUG, new thead

Reply via email to