In this problem: http://arstechnica.com/news.ars/post/20080430-ps3s-cell-cpu-tops-high-performance-computing-benchmark.html
They obtained at maximum 30% of peak performance in x86 processors. In Cell and niagara2 they obtained about 60% peak performance. It seems that in memory intensive codes, the processor must have massive memory bandwidth to get near the peak performance. 2008/8/29 Mikhail Kuzminsky <[EMAIL PROTECTED]> > In message from "Li, Bo" <[EMAIL PROTECTED]> (Fri, 29 Aug 2008 08:15:42 > +0800): > >> Yes, Firestream has a great paper performance, but how can you get it? >> But for the costs, if you don't mind to use some un-professional >> components, you can try their gaming cards, much cheaper. We bought NVidia's >> last flagship card 8800Ultra for 600 Euro, what's a crazy price, and now you >> can buy two GTX280 for less. If you can bear SP, and you will get 936GFlops >> for each. And we have achieved 40% of their peak performance, sounds good. >> > > But which percent of peak value you may have on x86 CPU ? > If it's something like sgemm, then it looks not too attractive for me :-( : > on usual x86 I may obtain about 90% of peak performance, plus performance > difference of Xeon/Opteron CPUs w/GPU on DP is not too high :-( > Mikhail > > > Regards, >> Li, Bo >> ----- Original Message ----- From: "Mikhail Kuzminsky" <[EMAIL PROTECTED]> >> To: "Li, Bo" <[EMAIL PROTECTED]> >> Cc: "Vincent Diepeveen" <[EMAIL PROTECTED]>; <beowulf@beowulf.org> >> Sent: Friday, August 29, 2008 1:52 AM >> Subject: Re: [Beowulf] gpgpu >> >> >> In message from "Li, Bo" <[EMAIL PROTECTED]> (Thu, 28 Aug 2008 14:20:15 >>> +0800): >>> >>>> ... >>>> Currently, the DP performance of GPU is not good as we expected, or only >>>> 1/8 1/10 of SP Flops. It is also a problem. >>>> >>> >>> AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W for >>> DP. It's 5 times slower than SP. >>> >>> Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS >>> DP. The price will be, I suppose, about $2000 - as for 9170. >>> >>> Let me look to modern dual socket quad-core beowulf node w/price about >>> $4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP >>> performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 GFLOPS. >>> >>> Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs. >>> Is it enough for essential calculation speedup - taking into account time >>> for data transmission to/from GPU ? >>> >>>> I would suggest hybrid computation platforms, with GPU, CPU, and >>>> processors like Clearspeed. It may be a good topic for programming model. >>>> >>> >>> Clearspeed, if there is no new hardware now, has not enough DP >>> performance in comparison w/typical modern servers on quad-core CPUs. >>> >>> Yours >>> Mikhail >>> >>>> Regards, >>>> Li, Bo >>>> ----- Original Message ----- From: "Vincent Diepeveen" <[EMAIL PROTECTED]> >>>> To: "Li, Bo" <[EMAIL PROTECTED]> >>>> Cc: "Mikhail Kuzminsky" <[EMAIL PROTECTED]>; "Beowulf" >>>> <beowulf@beowulf.org> >>>> Sent: Thursday, August 28, 2008 12:22 AM >>>> Subject: Re: [Beowulf] gpgpu >>>> >>>> >>>> Hi Bo, >>>>> >>>>> Thanks for your message. >>>>> >>>>> What library do i call to find primes? >>>>> >>>>> Currently it's searching here after primes (PRP's) in the form of p >>>>> = (2^n + 1) / 3 >>>>> >>>>> n is here about 1.5 million bits roughly as we speak. >>>>> >>>>> For SSE2 type processors there is the George Woltman assembler code >>>>> (MiT) to do the squaring + implicit modulo; >>>>> how do you plan to beat that type of real optimized number crunching >>>>> at a GPU? >>>>> >>>>> You'll have to figure out a way to find an instruction level >>>>> parallellism of at least 32, >>>>> which also doesn't write to the same cacheline, i *guess* (no >>>>> documentation to verify that in fact). >>>>> >>>>> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes >>>>> >>>>> In fact the first problem to solve is to do some sort of squaring real >>>>> quickly. >>>>> >>>>> If you figured that out at a PC, experience learns you're still losing >>>>> a potential of factor 8, >>>>> thanks to another zillion optimizations. >>>>> >>>>> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver >>>>> on paper @ 250 watt TDP (you bet it will consume that >>>>> when you let it work so hard) means GPU delivers effectively less than >>>>> 7 gflops double precision thanks to inefficient code. >>>>> >>>>> Additionally remember the P4. On paper in integers claim was when it >>>>> released it would be able to execute 4 integers a >>>>> cycle, reality is that it was a processor getting an IPC far under 1 >>>>> for most integer codes. All kind of stuff sucked at it. >>>>> >>>>> The experience learns this is the same for todays GPU's, the >>>>> scientists who have run codes on it so far and are really experienced >>>>> CUDA programmers, figured out the speed it delivers is a very big >>>>> bummer. >>>>> >>>>> Additionally 250 watt TDP for massive number crunching is too much. >>>>> >>>>> It's well over factor 2 power consumption of a quadcore. Now i can >>>>> take a look soon in China myself what power prices >>>>> are over there, but i can assure you they will rise soon. >>>>> >>>>> Now that's a lot less than a quadcore delivers with a tdp far under >>>>> 100 watt. >>>>> >>>>> Now i explicitly mention the n's i'm searching here, as it should fit >>>>> within caches. >>>>> So the very secret bandwidth you can practical achieve (as we know >>>>> nvidia lobotomized >>>>> bandwidth in the GPU cards, only the Tesla type seems to be not >>>>> lobotomized), >>>>> i'm not even teasing you with that. >>>>> >>>>> This is true for any type of code. You're losing it to the details. >>>>> Only custom tailored solutions will work, >>>>> simply because they're factors faster. >>>>> >>>>> Thanks, >>>>> Vincent >>>>> >>>>> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote: >>>>> >>>>> Hello, >>>>>> IMHO, it is better to call the BLAS or similiar libarary rather than >>>>>> programing you own functions. And CUDA treats the GPU as a cluster, so >>>>>> .CU >>>>>> is not working as our normal codes. If you have got >>>>>> to many matrix or vector computation, it is better to use Brook+/ CAL, >>>>>> which can show great power of AMD gpu. >>>>>> Regards, >>>>>> Li, Bo >>>>>> ----- Original Message ----- >>>>>> From: "Mikhail Kuzminsky" <[EMAIL PROTECTED]> >>>>>> To: "Vincent Diepeveen" <[EMAIL PROTECTED]> >>>>>> Cc: "Beowulf" <beowulf@beowulf.org> >>>>>> Sent: Wednesday, August 27, 2008 2:35 AM >>>>>> Subject: Re: [Beowulf] gpgpu >>>>>> >>>>>> >>>>>> In message from Vincent Diepeveen <[EMAIL PROTECTED]> (Tue, 26 Aug 2008 >>>>>>> 00:30:30 +0200): >>>>>>> >>>>>>>> Hi Mikhail, >>>>>>>> >>>>>>>> I'd say they're ok for black box 32 bits calculations that can do >>>>>>>> with >>>>>>>> a GB or 2 RAM, >>>>>>>> other than that they're just luxurious electric heating. >>>>>>>> >>>>>>> >>>>>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or >>>>>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to >>>>>>> BLAS/LAPACK/FFT :-) >>>>>>> >>>>>>> So I'll need to program something other. People say that the best >>>>>>> choice is CUDA for Nvidia. When I look to sgemm source, it has about >>>>>>> 1 >>>>>>> thousand (or higher) strings in *.cu files. Thereofore I think that a >>>>>>> bit more difficult alghorithm as some special matrix diagonalization >>>>>>> will require a lot of programming work :-(. >>>>>>> >>>>>>> It's interesting, that when I read Firestream Brook+ "kernel >>>>>>> function" >>>>>>> source example - for addition of 2 vectors ("Building a High Level >>>>>>> Language Compiler For GPGPU", >>>>>>> Bixia Zheng ([EMAIL PROTECTED]) >>>>>>> Derek Gladding ([EMAIL PROTECTED]) >>>>>>> Micah Villmow ([EMAIL PROTECTED]) >>>>>>> June 8th, 2008) >>>>>>> >>>>>>> - it looks SIMPLE. May be there are a lot of details/source lines >>>>>>> which were omitted from this example ? >>>>>>> >>>>>>> >>>>>>> Vincent >>>>>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is >>>>>>>> really >>>>>>>> too much. >>>>>>>> >>>>>>> >>>>>>> 250 W is TDP, the average value declared is about 160 W. I don't >>>>>>> remember, which GPU - from AMD or Nvidia - has a lot of special >>>>>>> functional units for sin/cos/exp/etc. If they are not used, may be >>>>>>> the >>>>>>> power will a bit more lower. >>>>>>> >>>>>>> What is about Firestream 9250, AMD says about 150 W (although I'm not >>>>>>> absolutely sure that it's TDP) - it's as for some >>>>>>> Intel Xeon quad-cores chips w/names beginning from X. >>>>>>> >>>>>>> Mikhail >>>>>>> >>>>>>> >>>>>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote: >>>>>>>> >>>>>>>> BTW, why GPGPUs are considered as vector systems ? >>>>>>>>> Taking into account that GPGPUs contain many (equal) execution >>>>>>>>> units, >>>>>>>>> I think it might be not SIMD, but SPMD model. Or it depends from >>>>>>>>> the software tools used (CUDA etc) ? >>>>>>>>> >>>>>>>>> Mikhail Kuzminsky >>>>>>>>> Computer Assistance to Chemical Research Center >>>>>>>>> Zelinsky Institute of Organic Chemistry >>>>>>>>> Moscow >>>>>>>>> _______________________________________________ >>>>>>>>> Beowulf mailing list, Beowulf@beowulf.org >>>>>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Beowulf mailing list, Beowulf@beowulf.org >>>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>> > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf