help the winner win http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone
On 4 October 2012 21:05, Eugen Leitl <eu...@leitl.org> wrote: > > http://www.streamcomputing.eu/blog/2012-08-27/processors-that-can-do-20-gflops-watt/ > > Processors that can do 20+ GFLOPS/Watt > > by Vincent Hindriksen > > August 27, 2012 > 16 Comments > > For yearly power-usage there is a rule-of-thumb which states that a device > that is continuously on, costs the amount of Watt times 1.5 in Euro per year. > So the computer in front of me, that takes around 107 Watt, costs me €160 a > year if I would leave it on. A moderate cluster with several GPUs of a few > hundred Watts each, would cost a few thousand Euros a year. I would say: very > doable for most companies. > > So why is the performance per Watt? There is more to a Watt than just the > costs. The energy to cool a cluster is quite high, as most of the energy > escapes via heat. And then there is the increase in demand for portable > power. In cases you are thinking of sweeping you credit card for a top 10 > supercomputer, then these energy-costs are extremely high. > > In this article I try to get an overview of who is entering the 20+ > GFLOPS/Watt area. All processors that do less than 20 GFLOPS/Watt, need to > have other advantages to survive. And you’ll see that all the green > processors are programmed with OpenCL, the technology StreamComputing is all > about. > The list > > Understand that since I mix CPUs, GPUs and SoCs (= CPU+GPU) the list is > really only an indication of what is possible. Also a computer is built up of > more energy-consuming parts than just the processors: interconnects, memory, > harddrives, etc. > > Disclaimer: The below list is incomplete and based on theoretical values. TDP > is assumed to be consumed when processor is working at maximum performance. > Actual FLOPS/Watt values can be much lower, depending on many factors. If you > want to buy hardware specifically for the purpose of highest FLOPS/Watt have > your software tested on the device. > Processor Type GFLOPS (32bit) GFLOPS (64bit) Watt (TDP) > GFLOPS/Watt (32bit) FLOPS/Watt (64bit) > Adapteva Epiphany-IV Epiphany 100 N/A 2 50 N/A > Movidius Myriad ARM SoC: LEON3+SHAVE 15.28 N/A 0.32 48 N/A > ZiiLabs ARM SoC 58 N/A ? 20? N/A > Nvidia Tesla K10 X86 GPU 4577 190 225 20.34 ? > ARM + NEON T604 ARM SoC 8 + 68 N/A 4? 19? N/A > NVidia GTX 690 X86 GPU x 2 5621 234? 300 18.74 0.78 > GeForce GTX 680 X86 GPU 3090 128 195 15.85 0.65 > AMD Radeon HD 7970 GHz X86 GPU 4300 1075 300+ 14.3 3.58 > Intel Knight's Corner (Xeon Phi) X87? 2000? 1000 200? 10? > 5? > AMD A10-5800K + HD 7660D X86 SoC 121 + 614 ? 100 7.35 > ? > Intel Core i7-3770 + HD4000 X86 SoC 225 + 294,4 112 + 73.6 77 > 6.74 2.41 > IBM Power A2 Power CPU 204? 204 55 3.72? 3.72 > Intel Core i7-3770 X86 CPU 225 112 ? ? ? > AMD A10-5800K X86 CPU 121 60? ? ? ? > > The list contains recent and general available processors, but I will add any > processor you want to see in the list – just request them in a comment. > > Please also point me to sources where official data can be found on these > processors, as it seems to be top-secret data. As not all the data was > available, I had to make some guesses. > CPU vs GPU > > Let’s be clear: > > A GPU needs a CPU as a host. > A GPU is great in vector-computations, a CPU much better in scalar > computations. > > In other words, a mix between a scalar and a vector processor is best. But > once a problem can be defined as a vector-problem, the GPU is much, much > faster than a CPU. > 64 bit vs 32 bit > > As the memory-usage is energy-consuming and results in half the number of > data showing up at the processor, we have two reasons why more energy is > consumed. Due to architecture-differences, CPUs have a penalty for 32 bit and > GPUs a penalty for 64 bit. > > Notice that most X86-alternatives have no 64 bit support, or just recently > started with it. GPUs crunch double precision numbers at a fourth or less of > the 32-bit performance-roof. > Architectures > > ARM, X86/X87, Power and Epiphany all have different architecture-choices to > get their targeted trade-off between precision, power-consumption and > performance-optimisation (control unit). These choices make it sometimes > impossible to get with the pace of other architectures in a certain direction. > Current winner: Adapteva Epiphany > > Their 64-core Epiphany-IV is programmable with OpenCL and the 50 GFLOPS/Watt > makes it worth to put time in porting software if you need a portable device. > People who have ported their software to OpenCL already have an advantage > here. Adapteva even claims 72 GFLOPS/Watt, as you can read here. With a > 100-core CPU coming up, they will probably even raise the bar. > > X86 CPUs have the advantage of precision and legacy code, of which precision > is the biggest advantage. As X86 GPUs (with Nvidia on top) have a great > performance/Watt entering the 20+ GFLOPS/Watt, this could be very interesting > for defending the X86 market against ARM. > > ARM-processors have a lot of software written for it (via Android) and is > very flexible in design, while keeping power-usage for the CPU-part around > 1Watt. For instance ZiiLabs’ processor can be compared to the design of > Adapteva, but then with an ARM-CPU attached to it. > Conclusion > > There is much more than just this number of GFLOPS/Watt, and which > architecture will be mainstream architecture in a few years one can only > speculate on. Luckily recompiling for other architectures is getting easier > with compiler-technologies such as LLVM, so we don’t need to worry too much. > Except to redesign our software for multi-core of course. You have read above > that new architectures are programmed with OpenCL. It is better to invest in > this technology now than later. > More reading > > As memory-access takes energy, minimising memory-calls can lower consumption. > This article on the ARM blog explains how this is done with MALI GPUs. > > The Mont Blanc project is a supercomputer based on ARM. This 12 page PDF > shows some numbers and specifications of this supercomputer. > > As supercomputers eat lots of power, The Green 500 tries to stimulate to > build greener HPC. > > Related content: > > Power to the Vector Processor > AMD’s answer to NVIDIA TESLA K10: the FirePro S9000 > Let’s enter the Top500 HPC list using GPUs > > david moloney > > If you were at HotChips 2011 you would have seen Movidius Myriad which > delivers 50GFLOPS/W > streamcomputing > > Ok, added to the list – I will update the text later. How much GFLOPS > does it deliver? > david moloney > > Also Epiphany is shown as ARM based in your table which I’m sure must be > a mistake. > streamcomputing > > Oops, that’s not ARM at all! Thanks for noticing! > pip010 > > 64 RISC units, but not mentioning what. good chance it is ARM! > MySchizoBuddy > > Can you include Tilera chips on the list. > http://www.tilera.com/ > > I don’t know how many GFlops it delivers > streamcomputing > > True, not mentioned. I only found > http://www.tgdaily.com/business-and-law-features/39408-tilera-goes-pro-with-tilepro64 > from 2007 – they do not provide any information on actual performance/Watt > anywhere. Or very hidden. > PENG ZHAO > > How about Nvidia Tesla K10 and Geforce GTX 690? I found some figures. > Tesla K10: > Power: 225 W > Single float: 4577 Gigaflops, 20.342 GFlops/W > Double float: 190 Gigaflops, 0.8444 GFlops/W > > GTX 690: > Power: 300 W > Single float: 5621Gigaflops, 18.74 GFlops/W > Double float: ? > > The single float computation power is impressive, but the double float > one is rubbish. > Even worse Nvidia seems to stop the update of their OpenCL implementation. > streamcomputing > > The GTX 690 is a double GPU, so therefore I chose to put the 680 in > the list – maybe good point to add double-GPU cards too. > > It seems that my source for the K20 was completely wrong. I’ll update > for the K10 for now. > E P > > The table says FLOPS/Watt, instead of GFLOPS/Watt. > > Can you, please, include Integer arithmetic? > > Depending on floating point (especially 32-bit) is sometimes not an > option due to accumulation of errors. So, a lot of integer arithmetic > algorithms have been developed. Main point in porting them to OpenCL will be > keeping the integer arithmetic calculations. And, that becomes even more > important having in mind that a lot of devices increase performance and/or > decrease power consumption at the expense of accuracy. > streamcomputing > > It was extremely difficult (and exhausting) to find the data already > in the list. I will therefore focus on what is already there and try to > complete the list for just 32-bit and 64-bit (being it floats or integers). > > The trade-off between precision and the other aspects of computing is > an interesting subject though. > > Pingback: Processors that can do 20 GFLOPS/Watt | Adapteva > http://twitter.com/codedivine rahul garg > > Corrections: > > 1. The 3770K’s peak (CPU-only) is about 225 GFlops (at base frequency, > with turbo slightly higher). > > 2. Knight’s corner has fp32 at twice the rate of fp64. So I expect 2 > teraflop for fp32 for knights corner. > > 3. 3770K CPU-only fp64 peak is half of fp64 peak = 112 gflops. > http://twitter.com/codedivine rahul garg > > Another correction: HD 4000 on 3770K has fp64 peak of 73.6 gflops. > streamcomputing > > Thanks for all the feedback! You’re great! Together we can make the > picture. > http://twitter.com/daphreak Stuart > > http://www.kalray.eu/en/technology/mppa-256.html seems to be a good > performer but no OpenCL support. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf