I have followed this thinking "square peg, round hole." You have got it again, Joe. Compilers are your problem.
On Sun, Jun 20, 2021, 10:21 AM Joe Landman <joe.land...@gmail.com> wrote: > (Note: not disagreeing at all with Gerald, actually agreeing strongly ... > also, correct address this time! Thanks Gerald!) > > > On 6/19/21 11:49 AM, Gerald Henriksen wrote: > > On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: > > > The answer given, and I'm > not making this up, is that AMD listens to their users and gives the > users what they want, and right now they're not hearing any demand for > AVX512. > > More accurately, there is call for it. From a very small segment of the > market. Ones who buy small quantities of processors (under 100k volume per > purchase). > > That is, not a significant enough portion of the market to make a huge > difference to the supplier (Intel). > > And more to the point, AI and HPC joining forces has put the spotlight on > small matrix multiplies, often with lower precision. I'm not sure (haven't > read much on it recently) if AVX512 will be enabling/has enabled support > for bfloat16/FP16 or similar. These tend to go to GPUs and other > accelerators. > > Personally, I call BS on that one. I can't imagine anyone in the HPC > community saying "we'd like processors that offer only 1/2 the floating > point performance of Intel processors". > > I suspect that is marketing speak, which roughly translates to not > that no one has asked for it, but rather requests haven't reached a > threshold where the requests are viewed as significant enough. > > This, precisely. AMD may be losing the AVX512 users to Intel. But that's > a small/miniscule fraction of the overall users of its products. The > demand for this is quite constrained. Moreover, there are often > significant performance consequences to using AVX512 (downclocking, > pipeline stalls, etc.) whereby the cost of enabling it and using it, far > outweighs the benefits of providing it, for the vast, overwhelming portion > of the market. > > And, as noted above on the accelerator side, this use case (large vectors) > are better handled by the accelerators. There is a cost (engineering, code > design, etc.) to using accelerators as well. But it won't directly impact > the CPUs. > > Sure, AMD can offer more cores, > but with only AVX2, you'd need twice as many cores as Intel processors, > all other things being equal. > > ... or you run the GPU versions of the code, which are likely getting more > active developer attention. AVX512 applies to only a miniscule number of > codes/problems. Its really not a panacea. > > More to the point, have you seen how "well" compilers use AVX2/SSE > registers and do code gen? Its not pretty in general. Would you want the > compilers to purposefully spit out AVX512 code the way the do AVX2/SSE code > now? I've found one has to work very hard with intrinsics to get good > performance out of AVX2, never mind AVX512. > > Put another way, we've been hearing about "smart" compilers for a while, > and in all honesty, most can barely implement a standard correctly, never > mind generate reasonably (near) optimal code for the target system. This > has been a problem my entire professional life, and while I wish they were > better, at the end of the day, this is where human intelligence fits into > the HPC/AI narrative. > > But of course all other things aren't equal. > > AVX512 is a mess. > > Understated, and yes. > > Look at the Wikipedia page(*) and note that AVX512 means different > things depending on the processor implementing it. > > I made comments previously about which ISA ARM folks were going to write > to. That is, different processors, likely implementing different > instructions, differently ... you won't really have 1 equally good compiler > for all these features. You'll have a compiler that implements common > denominators reasonably well. Which mitigates the benefits of the > ISA/architecture. > > Intel has the same problem with AVX512. I know, I know ... feature flags > on the CPU (see last line of lscpu output). And how often have certain > (ahem) compilers ignored the flags, and used a different mechanism to > determine CPU feature support, specifically targeting their competitor > offerings to force (literally) low performance paths for those CPUs? > > > So what does the poor software developer target? > > Lowest common denominator. Make the code work correctly first. Then make > it fast. If fast is platform specific, ask how often with that platform be > used. > > > Or that it can for heat reasons cause CPU frequency reductions, > meaning real world performance may not match theoritical - thus easier > to just go with GPU's. > > The result is that most of the world is quite happily (at least for > now) ignoring AVX512 and going with GPU's as necessary - particularly > given the convenient libraries that Nvidia offers. > > Yeah ... like it or not, that battle is over (for now). > > [...] > > > An argument can be made that for calculations that lend themselves to > vectorization should be done on GPUs, instead of the main processors but > the last time I checked, GPU jobs are still memory is limited, and > moving data in and out of GPU memory can still take time, so I can see > situations where for large amounts of data using CPUs would be preferred > over GPUs. > > AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, > which may or may not mean a difference. > > It does. IO and memory bandwidth/latency are very important, and oft > overlooked aspects of performance. If you have a choice of doubling IO and > memory bandwidth at lower latency (usable by everyone) vs adding an AVX512 > unit or two (usable by a small fraction of a percent of all users), which > would net you, as an architect, the best "bang for the buck"? > > > But what despite all of the above and the other replies, it is AMD who > has been winning the HPC contracts of late, not Intel. > > There's a reason for that. I will admit I have a devil of a time trying > to convince people that higher clock frequency for computing matters only > to a small fraction of operations, especially ones waiting on (slow) RAM > and (slower) IO. Make the RAM and IO faster (lower latency, higher > bandwidth), and the system will be far more performant. > > > > -- > > Joe Landman > e: joe.land...@gmail.com > t: @hpcjoe > w: https://scalability.org > g: https://github.com/joelandman > l: https://www.linkedin.com/in/joelandman > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf