> So, make sure your code vectorises and has no thread-blocking points. I remember hearing this when learning MPI as an undergrad back in the 90's... It's probably always been true!
On Sun, Nov 19, 2017 at 7:22 PM, Stu Midgley <sdm...@gmail.com> wrote: > We have found that running in cached/quadrant mode gives excellent > performance. With our codes, the optimal is 2threads per core. KNL broke > the model of KNC which did a full context change every clock cycle (so you > HAD to have multiple threads per core) which has had the roll-on effect of > reducing the number of threads required to run to get maximum performance. > However, if your code scales to 128 threads... it probably scales to 240. > So it probably doesn't matter. > > The programming model is much easier than GPU's. We have codes running > (extremely fast) on KNL that no one has managed to get running on GPU's > (mostly due to the memory model of the GPU). > > So you shouldn't write them off. No matter which way you turn, you will > most likely have x86 and lots of corse... and those cores will have AVX512 > going forward (and probably later AVX1024 or what ever they'll call it). > So, make sure your code vectorises and has no thread-blocking points. > > > On Mon, Nov 20, 2017 at 9:09 AM, Richard Walsh <rbwcn...@gmail.com> wrote: > >> >> Well ... >> >> KNL is (only?) superior for highly vectorizable codes that at scale can >> run out of MCDRAM (slow scaler performance). Multiple memory and >> interconnect modes (requiring a reboot to change) create a programming >> complexity (e.g managing affinity across 8-9-9-8 tiles in quad mode) that >> few outside the National Labs were able-interested in managing. Using 4 >> hyper threads not often useful. When used in cache mode, direct mapped L3 >> cache suffers gradual perform degradation from fragmentation. Delays in >> its release and in the tuning of the KNL BIOS for performance shrunk its >> window of advantage over Xeon line significantly, as well as then new GPUs >> (Pascal). Meeting performance programming challenges added to this shrink >> (lots of dungeon sessions), FLOPS per Watt good but not as good as GPU. >> Programming environment compatibility good, although there are those >> instruction subsets that are not portable ... got to build with >> >> -xCOMMON-AVX512 ... >> >> But as someone said “it is fast” ... I would say maybe now it “was fast” >> for a comparably short period of time. If you already have 10s of racks >> and have them figured out then you like the reduced operating cost and may >> just buy some more as the price drops, but if you did not buy in gen 1 then >> maybe you are not so disappointed at the change of plans ... and maybe it >> is time to merge many-core and multi-core anyway. >> >> Richard Walsh >> Thrashing River Computing >> >> > > > -- > Dr Stuart Midgley > sdm...@gmail.com > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Mississippi River and 44th Parallel - - - - - - - - - - - - - - - - - - - - -
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf