We have a code written on both the Phi and K10's and they give about the same performance (both highly optimised finite difference codes).
-- Dr Stuart Midgley sdm...@sdm900.com On 15/02/2013, at 4:53 AM, Richard Walsh <rbwcn...@gmail.com> wrote: > > Hey Stuart, > > Thanks much for the detail. > > So, if I am reading you correctly your test was on a single > physical PHI (you will later expand to multiple PHIs). This > was a highly parallel single precision application which showed > the expected linear speed up to 60 cores ... then a kink as you > cross into hyper-threaded operation with a 1/2 as steep slope > up to factor of two to 120 core-equivalents with a 4 to 1 over > subscription of hyper-threads. This was all done with the Intel > compilers on an unmodified pthreaded code that is well-vectored. > > A good result ... on an application that is a perfect candidate > for PHI. To run elsewhere with CUDA, OpenMP, or OpenACC > directives would require quite a bit of recoding which you were > happy to avoid. My guess is if you had a CUDA implementation > you would see better performance on a FERMI or KEPLER, > but that is a programming path you do not wish to take. > > This is an interesting case to hear about. The flack (technical > marketing) from NVIDIA is to focus on the difficulty of using > the 'offload' model and Intel extensions to OpenMP, Cylk, etc., > articulate their hardware's performance advantages, and talk > about OpenACC. These arguments are not unreasonable, but > clearly not universallydeciding. > > Thanks much ... and good luck getting all your other codes > to scale just as well. > > rbw > > On Thu, Feb 14, 2013 at 10:18 AM, Dr Stuart Midgley <sdm...@gmail.com> wrote: > Evening > > Sorry for the slow response. > > Most of our codes are pthreads, we have avoided MPI and OpenMP as much as > possible. Our current cluster consists of Nehalem, Westmere, Sandy Bridge > and Interlagos of various flavours. Our Phi cards are in Sandy Bridge > systems (host machine has 16 cores with 128GB ram). We run the intel > compilers. > > Our fastest systems are the 64core Interlagos systems (256GB ram) running at > 2.6GHz. For a few of our most important kernels, a single phi had greater > throughput than a whole node. Which, if you count the flops, is expected. > The Phi's have a massive amount of single precision floating point > performance (our codes are single precision). > > Our kernels vectorise very well (lots of hand coded SSE3) and are expected > for run very well on the phi (we haven't tested these codes yet). The codes > we have tested are trivially parallel and very FP heavy - they ported easily > to the phi and run very well. > > The codes I tested (in like 2hrs) saw linear speedup to 60cores and then a > "kink" in performance and then continued performance gains right up to 240 > threads. Essentially these codes are single cpu with a trivial wrapper > around them to hand out work. This is exactly what hyper threading was > designed to help. So at 240 threads, we were about 120 times faster than a > single thread of this code. At 60 threads, we were 60 times faster :) > > Again, since the codes I tested were small data in, small data out and heavy > compute and trivially parallel, running over multiple phi's is trivial and > provide linear performance gains. As we start porting more of our complex > codes, I expect to see similar gains. Our codes already run very very well > on 64 cores… > > The phi's are separate cards, in separate pcie slots. I have not delved into > the programming api's fully, but I suspect you can utilise the one phi card > for your threaded codes. The way I've been running is with a native phi > application (basically using the Phi as a separate linux cluster node)… using > it in offload mode is very different and you may well be able to get your > kernel running across both with the right pragmas. > > To be 100% honest, we took the boots and all approach. If we only purchased > 1 phi to test on, we would never expend the energy to port all our codes. > Purchasing hundreds of them gives you a lot of impetus to port your codes > quickly :) > > > -- > Dr Stuart Midgley > sdm...@sdm900.com > > > > > On 13/02/2013, at 12:38 AM, Richard Walsh <rbwcn...@gmail.com> wrote: > > > > > Hey Stuart, > > > > Thanks for your answer ... > > > > That sounds compelling. May I ask a few more questions? > > > > So should I assume that this was a threaded SMP type application > > (OpenMP, pthreads) or it is MPI based? Is the supporting CPU of the > > multi-core Sandy Bridge vintage? Have you been able to compare > > the hyper-threaded, multi-core scaling on that Sandy Bridge side of the > > system with that on the Phi (fewer cores to compare of course). Using the > > Intel compilers I assume ... how well do your kernels vectorize? Curious > > about the observed benefits of hyper-threading, which generally offers > > little to floating-point intensive HPC computations where functional unit > > collision is an issue. You said you have 2 Phis per node. Were you > > running a single job across both? Were the Phis in separate PCIE > > slots or on the same card (sorry I should know this, but I have just > > started looking at Phi). If they are on separate cards in separate > > slots can I assume that I am limited to MPI parallel implementations > > when using both. > > > > Maybe that is more than a few questions ... ;-) ... > > > > Regards, > > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf