Re: [Beowulf] Working for DUG, new thead

Prentice Bisbal Tue, 19 Jun 2018 14:00:48 -0700

On 06/19/2018 03:10 PM, Joe Landman wrote:

On 6/19/18 2:47 PM, Prentice Bisbal wrote:
On 06/13/2018 10:32 PM, Joe Landman wrote:
I'm curious about your next gen plans, given Phi's roadmap.


On 6/13/18 9:17 PM, Stu Midgley wrote:
low level HPC means... lots of things. BUT we are a huge Xeon Phishop and need low-level programmers ie. avx512, carefulcache/memory management (NOT openmp/compiler vectorisation etc).
I played around with avx512 in my rzf code.https://github.com/joelandman/rzf/blob/master/avx2/rzf_avx512.c . Never really spent a great deal of time on it, other than notingthat using avx512 seemed to downclock the core a bit on Skylake.
If you organize your code correctly, and call the compiler with theright optimization flags, shouldn't the compiler automatically handlea good portion of this 'low-level' stuff?
I wish it would do it well, but it turns out it doesn't do a goodjob. You have to pay very careful attention to almost all aspects ofmaking it simple for the compiler, and then constraining thedirections it takes with code gen.
I explored this with my RZF stuff. It turns out that with -O3, gcc(5.x and 6.x) would convert a library call for the power function intoan FP instruction. But it would use 1/8 - 1/4 of the XMM/YMM registerwidth, not automatically unroll loops, or leverage the vector natureof the problem.
Basically, not much has changed in 20+ years ... you annotate yourcode with pragmas and similar, or use instruction primitives and giveup on the optimizer/code generator.
When it comes down to it, compilers aren't really as smart as many ofus would like. Converting idiomatic code into efficient assemblyisn't what they are designed for. Rather correct assembly. Correctdoesn't mean efficient in many cases, and some of the less obviousoptimizations that we might think to be beneficial are not taken. Wecan hand modify the code for this, and see if these optimizations arebeneficial, but the compilers often are not looking at a holisticproblem.
I understand that hand-coding this stuff usually still give you thebest performance (See GotoBLAS/OpenBLAS, for example), but does youraverage HPC programmer trying to get decent performance need tohand-code that stuff, too?
Generally, yes. Optimizing serial code for GPUs doesn't work well.Rewriting for GPUs (e.g. taking into account the GPU data/compute flowarchitecture) does work well.

Thanks for the reply. This sounds like the perfect opportunity for me torant about Intel's marketing for Xeon Phi vs. GPUs. When GPUs took offand Intel was formulating their answer to GPUs, they kept saying youwouldn't need to rewrite your code like you need to for GPUs. You couldjust recompile and everything would work on the new MIC processors.

Then when Intel's MIC processors finally did come out, guess what? You*did* have to rewrite your code to get any meaningful increase inperformance. For example, you'd have to make sure your loops weredata-parallel and use OpenMP or TBB, or Cilk Plus or whatever, to reallytake advantage of the MIC. This meant you had to rewrite your code, butIntel did everything they could to avoid admitting you would need torewrite your code. Instead, they used the euphemism 'code modernization'instead.

I often wonder if that misleading marketing is one of the reasons whythe Xeon Phi has already been canned. I know a lot of people who wereexcited for the Xeon Phi, but I don't know any who ever bought the XeonPhis once they came out.


Prentice
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Working for DUG, new thead

Reply via email to