Re: [Beowulf] Processors that can do 20+ GFLOPS/Watt

Igor Kozin Fri, 05 Oct 2012 01:41:30 -0700

help the winner win
http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone


On 4 October 2012 21:05, Eugen Leitl <eu...@leitl.org> wrote:
>
> http://www.streamcomputing.eu/blog/2012-08-27/processors-that-can-do-20-gflops-watt/
>
> Processors that can do 20+ GFLOPS/Watt
>
> by Vincent Hindriksen
>
>     August 27, 2012
>     16 Comments
>
> For yearly power-usage there is a rule-of-thumb which states that a device 
> that is continuously on, costs the amount of Watt times 1.5 in Euro per year. 
> So the computer in front of me, that takes around 107 Watt, costs me €160 a 
> year if I would leave it on. A moderate cluster with several GPUs of a few 
> hundred Watts each, would cost a few thousand Euros a year. I would say: very 
> doable for most companies.
>
> So why is the performance per Watt? There is more to a Watt than just the 
> costs. The energy to cool a cluster is quite high, as most of the energy 
> escapes via heat. And then there is the increase in demand for portable 
> power. In cases you are thinking of sweeping you credit card for a top 10 
> supercomputer, then these energy-costs are extremely high.
>
> In this article I try to get an overview of who is entering the 20+ 
> GFLOPS/Watt area. All processors that do less than 20 GFLOPS/Watt, need to 
> have other advantages to survive. And you’ll see that all the green 
> processors are programmed with OpenCL, the technology StreamComputing is all 
> about.
> The list
>
> Understand that since I mix CPUs, GPUs and SoCs (= CPU+GPU) the list is 
> really only an indication of what is possible. Also a computer is built up of 
> more energy-consuming parts than just the processors: interconnects, memory, 
> harddrives, etc.
>
> Disclaimer: The below list is incomplete and based on theoretical values. TDP 
> is assumed to be consumed when processor is working at maximum performance. 
> Actual FLOPS/Watt values can be much lower, depending on many factors. If you 
> want to buy hardware specifically for the purpose of highest FLOPS/Watt have 
> your software tested on the device.
> Processor       Type    GFLOPS (32bit)  GFLOPS (64bit)  Watt (TDP)      
> GFLOPS/Watt (32bit)     FLOPS/Watt (64bit)
> Adapteva Epiphany-IV    Epiphany        100     N/A     2       50      N/A
> Movidius Myriad ARM SoC: LEON3+SHAVE    15.28   N/A     0.32    48      N/A
> ZiiLabs ARM SoC 58      N/A     ?       20?     N/A
> Nvidia Tesla K10        X86 GPU 4577    190     225     20.34   ?
> ARM + NEON T604 ARM SoC 8 + 68  N/A     4?      19?     N/A
> NVidia GTX 690  X86 GPU x 2     5621    234?    300     18.74   0.78
> GeForce GTX 680 X86 GPU 3090    128     195     15.85   0.65
> AMD Radeon HD 7970 GHz  X86 GPU 4300    1075    300+    14.3    3.58
> Intel Knight's Corner (Xeon Phi)        X87?    2000?   1000    200?    10?   
>   5?
> AMD A10-5800K + HD 7660D        X86 SoC 121 + 614       ?       100     7.35  
>   ?
> Intel Core i7-3770 + HD4000     X86 SoC 225 + 294,4     112 + 73.6      77    
>   6.74    2.41
> IBM Power A2    Power CPU       204?    204     55      3.72?   3.72
> Intel Core i7-3770      X86 CPU 225     112     ?       ?       ?
> AMD A10-5800K   X86 CPU 121     60?     ?       ?       ?
>
> The list contains recent and general available processors, but I will add any 
> processor you want to see in the list – just request them in a comment.
>
> Please also point me to sources where official data can be found on these 
> processors, as it seems to be top-secret data. As not all the data was 
> available, I had to make some guesses.
> CPU vs GPU
>
> Let’s be clear:
>
>     A GPU needs a CPU as a host.
>     A GPU is great in vector-computations, a CPU much better in scalar 
> computations.
>
> In other words, a mix between a scalar and a vector processor is best. But 
> once a problem can be defined as a vector-problem, the GPU is much, much 
> faster than a CPU.
> 64 bit vs 32 bit
>
> As the memory-usage is energy-consuming and results in half the number of 
> data showing up at the processor, we have two reasons why more energy is 
> consumed. Due to architecture-differences, CPUs have a penalty for 32 bit and 
> GPUs a penalty for 64 bit.
>
> Notice that most X86-alternatives have no 64 bit support, or just recently 
> started with it. GPUs crunch double precision numbers at a fourth or less of 
> the 32-bit performance-roof.
> Architectures
>
> ARM, X86/X87, Power and Epiphany all have different architecture-choices to 
> get their targeted trade-off between precision, power-consumption and 
> performance-optimisation (control unit). These choices make it sometimes 
> impossible to get with the pace of other architectures in a certain direction.
> Current winner: Adapteva Epiphany
>
> Their 64-core Epiphany-IV is programmable with OpenCL and the 50 GFLOPS/Watt 
> makes it worth to put time in porting software if you need a portable device. 
> People who have ported their software to OpenCL already have an advantage 
> here. Adapteva even claims 72 GFLOPS/Watt, as you can read here. With a 
> 100-core CPU coming up, they will probably even raise the bar.
>
> X86 CPUs have the advantage of precision and legacy code, of which precision 
> is the biggest advantage. As X86 GPUs (with Nvidia on top) have a great 
> performance/Watt entering the 20+ GFLOPS/Watt, this could be very interesting 
> for defending the X86 market against ARM.
>
> ARM-processors have a lot of software written for it (via Android) and is 
> very flexible in design, while keeping power-usage for the CPU-part around 
> 1Watt. For instance ZiiLabs’ processor can be compared to the design of 
> Adapteva, but then with an ARM-CPU attached to it.
> Conclusion
>
> There is much more than just this number of GFLOPS/Watt, and which 
> architecture will be mainstream architecture in a few years one can only 
> speculate on. Luckily recompiling for other architectures is getting easier 
> with compiler-technologies such as LLVM, so we don’t need to worry too much. 
> Except to redesign our software for multi-core of course. You have read above 
> that new architectures are programmed with OpenCL. It is better to invest in 
> this technology now than later.
> More reading
>
> As memory-access takes energy, minimising memory-calls can lower consumption. 
> This article on the ARM blog explains how this is done with MALI GPUs.
>
> The Mont Blanc project is a supercomputer based on ARM. This 12 page PDF 
> shows some numbers and specifications of this supercomputer.
>
> As supercomputers eat lots of power, The Green 500 tries to stimulate to 
> build greener HPC.
>
> Related content:
>
>     Power to the Vector Processor
>     AMD’s answer to NVIDIA TESLA K10: the FirePro S9000
>     Let’s enter the Top500 HPC list using GPUs
>
>      david moloney
>
>     If you were at HotChips 2011 you would have seen Movidius Myriad which 
> delivers 50GFLOPS/W
>         streamcomputing
>
>         Ok, added to the list – I will update the text later. How much GFLOPS 
> does it deliver?
>     david moloney
>
>     Also Epiphany is shown as ARM based in your table which I’m sure must be 
> a mistake.
>         streamcomputing
>
>         Oops, that’s not ARM at all! Thanks for noticing!
>             pip010
>
>             64 RISC units, but not mentioning what. good chance it is ARM!
>     MySchizoBuddy
>
>     Can you include Tilera chips on the list.
>     http://www.tilera.com/
>
>     I don’t know how many GFlops it delivers
>         streamcomputing
>
>         True, not mentioned. I only found 
> http://www.tgdaily.com/business-and-law-features/39408-tilera-goes-pro-with-tilepro64
>  from 2007 – they do not provide any information on actual performance/Watt 
> anywhere. Or very hidden.
>     PENG ZHAO
>
>     How about Nvidia Tesla K10 and Geforce GTX 690? I found some figures.
>     Tesla K10:
>     Power: 225 W
>     Single float: 4577 Gigaflops, 20.342 GFlops/W
>     Double float: 190 Gigaflops, 0.8444 GFlops/W
>
>     GTX 690:
>     Power: 300 W
>     Single float: 5621Gigaflops, 18.74 GFlops/W
>     Double float: ?
>
>     The single float computation power is impressive, but the double float 
> one is rubbish.
>     Even worse Nvidia seems to stop the update of their OpenCL implementation.
>         streamcomputing
>
>         The GTX 690 is a double GPU, so therefore I chose to put the 680 in 
> the list – maybe good point to add double-GPU cards too.
>
>         It seems that my source for the K20 was completely wrong. I’ll update 
> for the K10 for now.
>     E P
>
>     The table says FLOPS/Watt, instead of GFLOPS/Watt.
>
>     Can you, please, include Integer arithmetic?
>
>     Depending on floating point (especially 32-bit) is sometimes not an 
> option due to accumulation of errors. So, a lot of integer arithmetic 
> algorithms have been developed. Main point in porting them to OpenCL will be 
> keeping the integer arithmetic calculations. And, that becomes even more 
> important having in mind that a lot of devices increase performance and/or 
> decrease power consumption at the expense of accuracy.
>         streamcomputing
>
>         It was extremely difficult (and exhausting) to find the data already 
> in the list. I will therefore focus on what is already there and try to 
> complete the list for just 32-bit and 64-bit (being it floats or integers).
>
>         The trade-off between precision and the other aspects of computing is 
> an interesting subject though.
>
>     Pingback: Processors that can do 20 GFLOPS/Watt | Adapteva
>     http://twitter.com/codedivine rahul garg
>
>     Corrections:
>
>     1. The 3770K’s peak (CPU-only) is about 225 GFlops (at base frequency, 
> with turbo slightly higher).
>
>     2. Knight’s corner has fp32 at twice the rate of fp64. So I expect 2 
> teraflop for fp32 for knights corner.
>
>     3. 3770K CPU-only fp64 peak is half of fp64 peak = 112 gflops.
>     http://twitter.com/codedivine rahul garg
>
>     Another correction: HD 4000 on 3770K has fp64 peak of 73.6 gflops.
>         streamcomputing
>
>         Thanks for all the feedback! You’re great! Together we can make the 
> picture.
>     http://twitter.com/daphreak Stuart
>
>     http://www.kalray.eu/en/technology/mppa-256.html seems to be a good 
> performer but no OpenCL support.
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Processors that can do 20+ GFLOPS/Watt

Reply via email to