Hi all, This is, in my humble opinion, also the big problem CPUs are facing. They > are > build to tackle all possible scenarios, from simple integer to floating > point, > from in-memory to disc I/O. In some respect it would have been better to > stick > with a separate math unit which then could be selected according to your > workload you want to run on that server. I guess this is where the GPUs > are > trying to fit in here, or maybe ARM. >
I recall a few years ago the rumors that the Argonne "A18" system was going to use the 'Configurable Spatial Accelerators' that Intel was developing, with the idea being you *could* reconfigure based on the needs of the code. In principle, it sounds like the Holy Grail, but in practice it seems quite difficult, and I don't believe I've heard much more about the CSA approach since. WikiChip on the CSA: https://en.wikichip.org/wiki/intel/configurable_spatial_accelerator NextPlatform article: https://www.nextplatform.com/2018/08/30/intels-exascale-dataflow-engine-drops-x86-and-von-neuman/ I have to imagine that research hasn't gone fully quiet, especially with Intel's moves towards oneAPI and their FPGA experiences, but I haven't seen anything about it in a while. Of course.... > I also agree with the compiler "problem". If you are starting to push some > compilers too much, the code is running very fast but the results are > simply > wrong. Again, in an ideal world we have a compiler for the job for the > given > hardware which also depends on the job you want to run. > ... It exacerbates the compiler issues, *I think*. I hesitate to say it does so definitively, since the patent write-up talks about how the CSA architecture uses a representation very similar to what the (now old) Intel compilers created as an IR (intermediate representation). In my opinion, having a compiler that can 'do everything' is like having an AI that can do everything - we're good at very, *very* specific use-cases, but not generality. So configurable systems are a big challenge. (I'm *way* out of my depth on compilers, though - maybe they're improving massively?) > Maybe the whole climate problem will finally push HPC into the more > bespoken > system where the components are fit for the job in question, say weather > modeling for example, simply as that would be more energy efficient and > faster. > I can't speak to whether climate research will influence hardware, but back to the *original* theme of this thread, I actually had some data -very *limited* data, mind you!- on how NCAR's climate model, CESM, run in an 'F2000climo' case (one of many, many cases, and very atmospheric focused) at 2-degree atmosphere resolution (*very* coarse) on a 36-core Xeon Skylake performs across AVX2, AVX512 and AVX512+FMA. By default, FMA is turned off in these cases due to numerical sensitivity. So, that's a *very* specific case, but on the off chance people are curious, here's what it looks like - note that this is *noisy* data, because the model also does a lot of I/O, hence why I tend to look at median times, in blue below: SKX (AWS C5N.18xlarge) Performance Comparison CESM Case: F2000climo @ f19_g17 resolution (36 cores each component / 10 model day run, skipping 1st and last) Flags AVX2 (no FMA) AVX512 (no FMA) AVX512 + FMA Min 60.18 60.24 59.16 Max 66.26 60.47 59.40 Median 60.28 60.38 59.32 The take-away? We're not really benefiting *at all* (at this resolution, for this compset, etc) from AVX512 here. Maybe at higher resolution? Maybe with more vertical levels, or chemistry, or something like that? *Maybe*, but differences seem indistinguishable from noise here, and possibly negative! Now, give us more *memory bandwidth*, and that's fantastic. Could this code be rewritten to take better advantage of larger vectors? Sure, and some *really* capable people do work on that sort of stuff, and it helps, but as an *evolution* in performance, not a revolution in it. (Also, I'm always horrified by presenting one-off tests as examples of anything, but it's the only data I have on-hand! Other cases may indeed vary.) Before somebody comes along with: but but but it costs! Think about how > much > money is being spent simply to kill people, or at other wasteful project > like > Brexit etc. > One can only hope. When it comes to spending on research, I recall the quote: "If you think education is expensive, try ignorance!" Cheers, - Brian Am Montag, 21. Juni 2021, 14:46:30 BST schrieb Joe Landman: > > On 6/21/21 9:20 AM, Jonathan Engwall wrote: > > > I have followed this thinking "square peg, round hole." > > > You have got it again, Joe. Compilers are your problem. > > > > Erp ... did I mess up again? > > > > System architecture has been a problem ... making a processing unit > > 10-100x as fast as its support components means you have to code with > > that in mind. A simple `gfortran -O3 mycode.f` won't necessarily > > generate optimal code for the system ( but I swear ... -O3 ... it says > > it on the package!) > > > > Way back at Scalable, our secret sauce was largely increasing IO > > bandwidth and lowering IO latency while coupling computing more tightly > > to this massive IO/network pipe set, combined with intelligence in the > > kernel on how to better use the resources. It was simply a better > > architecture. We used the same CPUs. We simply exploited the design > > better. > > > > End result was codes that ran on our systems with off-cpu work (storage, > > networking, etc.) could push our systems far harder than competitors. > > And you didn't have to use a different ISA to get these benefits. No > > recompilation needed, though we did show the folks who were interested, > > how to get even better performance. > > > > Architecture matters, as does implementation of that architecture. > > There are costs to every decision within an architecture. For AVX512, > > along comes lots of other baggage associated with downclocking, etc. > > You have to do a cost-benefit analysis on whether or not it is worth > > paying for that baggage, with the benefits you get from doing so. Some > > folks have made that decision towards AVX512, and have been enjoying the > > benefits of doing so (e.g. willing to pay the costs). For the general > > audience, these costs represent a (significant) hurdle one must overcome. > > > > Here's where awesome compiler support would help. FWIW, gcc isn't that > > great a compiler. Its not performance minded for HPC. Its a reasonable > > general purpose standards compliant (for some subset of standards) > > compilation system. LLVM is IMO a better compiler system, and its > > clang/flang are developing nicely, albeit still not really HPC focused. > > Then you have variants built on that. Like the Cray compiler, Nvidia > > compiler and AMD compiler. These are HPC focused, and actually do quite > > well with some codes (though the AMD version lags the Cray and Nvidia > > compilers). You've got the Intel compiler, which would be a good general > > compiler if it wasn't more of a marketing vehicle for Intel processors > > and their features (hey you got an AMD chip? you will take the slowest > > code path even if you support the features needed for the high > > performance code path). > > > > Maybe, someday, we'll get a great HPC compiler for C/Fortran. > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf