On 01/12/2013 04:25 PM, Stu Midgley wrote: > Until the Phi's came along, we were purchasing 1RU, 4 sockets nodes > with 6276's and 256GB ram. On all our codes, we found the throughput > to be greater than any equivalent density Sandy bridge systems > (usually 2 x dual socket in 1RU) at about 10-15% less energy and > about 1/3 the price for the actual CPU (save a couple thousand $$ per > 1RU).
For many workloads we found similar. The last few generations of AMD CPUs have had 4 memory channels per socket. At first I was puzzled that even fairly memory intensive codes scaled well. Even following a random pointer chain performance almost doubled when I tested with 2 threads per memory channel instead of 1. Then I realized the L3 latency is almost half of the latency to main memory. So you get significant throughput advantages by having a queue of L3 cache misses waiting for the instant any of the memory channels free up. In fact even with 2 jobs per memory channel sometimes the memory channel goes idle. Even 4 jobs jobs per memory channel sees some increases. The good news is that most codes aren't as memory bandwidth/latency intensive as the related micro benchmarks (and therefore scale better). I think the more cores per memory channel is a key part of AMDs improved throughput per socket when compared to Intel. Not always true of course, again it's highly application dependent. > Of course, we are now purchasing Phi's. First 2 racks meant to turn > up this week. Interesting, please report back on anything of interest that you find. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf