We have a code written on both the Phi and K10's and they give about the same 
performance (both highly optimised finite difference codes).




--
Dr Stuart Midgley
sdm...@sdm900.com




On 15/02/2013, at 4:53 AM, Richard Walsh <rbwcn...@gmail.com> wrote:

> 
> Hey Stuart,
> 
> Thanks much for the detail.  
> 
> So, if I am reading you correctly your test was on a single
> physical PHI (you will later expand to multiple PHIs).  This
> was a highly parallel single precision application which showed
> the expected linear speed up to 60 cores ... then a kink as you
> cross into hyper-threaded operation with a 1/2 as steep slope
> up to factor of two to 120 core-equivalents with a 4 to 1 over
> subscription of hyper-threads.  This was all done with the Intel
> compilers on an unmodified pthreaded code that is well-vectored.
> 
> A good result ... on an application that is a perfect candidate
> for PHI.  To run elsewhere with CUDA, OpenMP, or OpenACC
> directives would require quite a bit of recoding which you were
> happy to avoid.  My guess is if you had a CUDA implementation
> you would see better performance on a FERMI or KEPLER,
> but that is a programming path you do not wish to take.
> 
> This is an interesting case to hear about.  The flack (technical
> marketing) from NVIDIA is to focus on the difficulty of using
> the 'offload' model and Intel extensions to OpenMP, Cylk, etc.,
> articulate their hardware's performance advantages, and talk
> about OpenACC. These arguments are not unreasonable, but
> clearly not universallydeciding. 
> 
> Thanks much ... and good luck getting all your other codes 
> to scale just as well.
> 
> rbw
> 
> On Thu, Feb 14, 2013 at 10:18 AM, Dr Stuart Midgley <sdm...@gmail.com> wrote:
> Evening
> 
> Sorry for the slow response.
> 
> Most of our codes are pthreads, we have avoided MPI and OpenMP as much as 
> possible.  Our current cluster consists of Nehalem, Westmere, Sandy Bridge 
> and Interlagos of various flavours.  Our Phi cards are in Sandy Bridge 
> systems (host machine has 16 cores with 128GB ram).  We run the intel 
> compilers.
> 
> Our fastest systems are the 64core Interlagos systems (256GB ram) running at 
> 2.6GHz.  For a few of our most important kernels, a single phi had greater 
> throughput than a whole node.  Which, if you count the flops, is expected.  
> The Phi's have a massive amount of single precision floating point 
> performance (our codes are single precision).
> 
> Our kernels vectorise very well (lots of hand coded SSE3) and are expected 
> for run very well on the phi (we haven't tested these codes yet).  The codes 
> we have tested are trivially parallel and very FP heavy - they ported easily 
> to the phi and run very well.
> 
> The codes I tested (in like 2hrs) saw linear speedup to 60cores and then a 
> "kink" in performance and then continued performance gains right up to 240 
> threads.  Essentially these codes are single cpu with a trivial wrapper 
> around them to hand out work.  This is exactly what hyper threading was 
> designed to help.  So at 240 threads, we were about 120 times faster than a 
> single thread of this code.  At 60 threads, we were 60 times faster :)
> 
> Again, since the codes I tested were small data in, small data out and heavy 
> compute and trivially parallel, running over multiple phi's is trivial and 
> provide linear performance gains.  As we start porting more of our complex 
> codes, I expect to see similar gains.  Our codes already run very very well 
> on 64 cores…
> 
> The phi's are separate cards, in separate pcie slots.  I have not delved into 
> the programming api's fully, but I suspect you can utilise the one phi card 
> for your threaded codes.  The way I've been running is with a native phi 
> application (basically using the Phi as a separate linux cluster node)… using 
> it in offload mode is very different and you may well be able to get your 
> kernel running across both with the right pragmas.
> 
> To be 100% honest, we took the boots and all approach.  If we only purchased 
> 1 phi to test on, we would never expend the energy to port all our codes.  
> Purchasing hundreds of them gives you a lot of impetus to port your codes 
> quickly :)
> 
> 
> --
> Dr Stuart Midgley
> sdm...@sdm900.com
> 
> 
> 
> 
> On 13/02/2013, at 12:38 AM, Richard Walsh <rbwcn...@gmail.com> wrote:
> 
> >
> > Hey Stuart,
> >
> > Thanks for your answer ...
> >
> > That sounds compelling.  May I ask a few more questions?
> >
> > So should I assume that this was a threaded SMP type application
> > (OpenMP, pthreads) or it is MPI based? Is the supporting CPU of the
> > multi-core Sandy Bridge vintage? Have you been able to compare
> > the hyper-threaded, multi-core scaling on that Sandy Bridge side of the
> > system with that on the Phi (fewer cores to compare of course).  Using the
> > Intel compilers I assume ... how well do your kernels vectorize?  Curious
> > about the observed benefits of hyper-threading, which generally offers
> > little to floating-point intensive HPC computations where functional unit
> > collision is an issue.  You said you have 2 Phis per node.  Were you
> > running a single job across both?  Were the Phis in separate PCIE
> > slots or on the same card (sorry I should know this, but I have just
> > started looking at Phi).  If they are on separate cards in separate
> > slots can I assume that I am limited to MPI parallel implementations
> > when using both.
> >
> > Maybe that is more than a few questions ... ;-) ...
> >
> > Regards,
> 
> 

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to