Evening

Sorry for the slow response.

Most of our codes are pthreads, we have avoided MPI and OpenMP as much as 
possible.  Our current cluster consists of Nehalem, Westmere, Sandy Bridge and 
Interlagos of various flavours.  Our Phi cards are in Sandy Bridge systems 
(host machine has 16 cores with 128GB ram).  We run the intel compilers.

Our fastest systems are the 64core Interlagos systems (256GB ram) running at 
2.6GHz.  For a few of our most important kernels, a single phi had greater 
throughput than a whole node.  Which, if you count the flops, is expected.  The 
Phi's have a massive amount of single precision floating point performance (our 
codes are single precision).

Our kernels vectorise very well (lots of hand coded SSE3) and are expected for 
run very well on the phi (we haven't tested these codes yet).  The codes we 
have tested are trivially parallel and very FP heavy - they ported easily to 
the phi and run very well.

The codes I tested (in like 2hrs) saw linear speedup to 60cores and then a 
"kink" in performance and then continued performance gains right up to 240 
threads.  Essentially these codes are single cpu with a trivial wrapper around 
them to hand out work.  This is exactly what hyper threading was designed to 
help.  So at 240 threads, we were about 120 times faster than a single thread 
of this code.  At 60 threads, we were 60 times faster :)

Again, since the codes I tested were small data in, small data out and heavy 
compute and trivially parallel, running over multiple phi's is trivial and 
provide linear performance gains.  As we start porting more of our complex 
codes, I expect to see similar gains.  Our codes already run very very well on 
64 cores…

The phi's are separate cards, in separate pcie slots.  I have not delved into 
the programming api's fully, but I suspect you can utilise the one phi card for 
your threaded codes.  The way I've been running is with a native phi 
application (basically using the Phi as a separate linux cluster node)… using 
it in offload mode is very different and you may well be able to get your 
kernel running across both with the right pragmas.

To be 100% honest, we took the boots and all approach.  If we only purchased 1 
phi to test on, we would never expend the energy to port all our codes.  
Purchasing hundreds of them gives you a lot of impetus to port your codes 
quickly :)


--
Dr Stuart Midgley
sdm...@sdm900.com




On 13/02/2013, at 12:38 AM, Richard Walsh <rbwcn...@gmail.com> wrote:

> 
> Hey Stuart,
> 
> Thanks for your answer ...
> 
> That sounds compelling.  May I ask a few more questions?
> 
> So should I assume that this was a threaded SMP type application
> (OpenMP, pthreads) or it is MPI based? Is the supporting CPU of the
> multi-core Sandy Bridge vintage? Have you been able to compare
> the hyper-threaded, multi-core scaling on that Sandy Bridge side of the
> system with that on the Phi (fewer cores to compare of course).  Using the
> Intel compilers I assume ... how well do your kernels vectorize?  Curious
> about the observed benefits of hyper-threading, which generally offers
> little to floating-point intensive HPC computations where functional unit
> collision is an issue.  You said you have 2 Phis per node.  Were you 
> running a single job across both?  Were the Phis in separate PCIE
> slots or on the same card (sorry I should know this, but I have just
> started looking at Phi).  If they are on separate cards in separate
> slots can I assume that I am limited to MPI parallel implementations
> when using both.
> 
> Maybe that is more than a few questions ... ;-) ...
> 
> Regards,

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to