Hi Vincent, Yes, the libraries can't cover all calculations, or they can do only some calculations based on GPU. GPGPU is just a small step for many-core architecture. It equips great power, but with deadly weakness. When you have got a complex calculations could be arranged to many pieces and each piece can work dependently. The calculation can be done well on a GPU or GPU failed to pump its power. Currently, the DP performance of GPU is not good as we expected, or only 1/8 1/10 of SP Flops. It is also a problem. I would suggest hybrid computation platforms, with GPU, CPU, and processors like Clearspeed. It may be a good topic for programming model. Regards, Li, Bo ----- Original Message ----- From: "Vincent Diepeveen" <[EMAIL PROTECTED]> To: "Li, Bo" <[EMAIL PROTECTED]> Cc: "Mikhail Kuzminsky" <[EMAIL PROTECTED]>; "Beowulf" <beowulf@beowulf.org> Sent: Thursday, August 28, 2008 12:22 AM Subject: Re: [Beowulf] gpgpu
> Hi Bo, > > Thanks for your message. > > What library do i call to find primes? > > Currently it's searching here after primes (PRP's) in the form of p > = (2^n + 1) / 3 > > n is here about 1.5 million bits roughly as we speak. > > For SSE2 type processors there is the George Woltman assembler code > (MiT) to do the squaring + implicit modulo; > how do you plan to beat that type of real optimized number crunching > at a GPU? > > You'll have to figure out a way to find an instruction level > parallellism of at least 32, > which also doesn't write to the same cacheline, i *guess* (no > documentation to verify that in fact). > > So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes > > In fact the first problem to solve is to do some sort of squaring > real quickly. > > If you figured that out at a PC, experience learns you're still > losing a potential of factor 8, > thanks to another zillion optimizations. > > You're not allowed to lose factor 8. that 52 gflop a gpu can deliver > on paper @ 250 watt TDP (you bet it will consume that > when you let it work so hard) means GPU delivers effectively less > than 7 gflops double precision thanks to inefficient code. > > Additionally remember the P4. On paper in integers claim was when it > released it would be able to execute 4 integers a > cycle, reality is that it was a processor getting an IPC far under 1 > for most integer codes. All kind of stuff sucked at it. > > The experience learns this is the same for todays GPU's, the > scientists who have run codes on it so far and are really experienced > CUDA programmers, figured out the speed it delivers is a very big > bummer. > > Additionally 250 watt TDP for massive number crunching is too much. > > It's well over factor 2 power consumption of a quadcore. Now i can > take a look soon in China myself what power prices > are over there, but i can assure you they will rise soon. > > Now that's a lot less than a quadcore delivers with a tdp far under > 100 watt. > > Now i explicitly mention the n's i'm searching here, as it should fit > within caches. > So the very secret bandwidth you can practical achieve (as we know > nvidia lobotomized > bandwidth in the GPU cards, only the Tesla type seems to be not > lobotomized), > i'm not even teasing you with that. > > This is true for any type of code. You're losing it to the details. > Only custom tailored solutions will work, > simply because they're factors faster. > > Thanks, > Vincent > > On Aug 27, 2008, at 2:50 AM, Li, Bo wrote: > >> Hello, >> IMHO, it is better to call the BLAS or similiar libarary rather >> than programing you own functions. And CUDA treats the GPU as a >> cluster, so .CU is not working as our normal codes. If you have got >> to many matrix or vector computation, it is better to use Brook+/ >> CAL, which can show great power of AMD gpu. >> Regards, >> Li, Bo >> ----- Original Message ----- >> From: "Mikhail Kuzminsky" <[EMAIL PROTECTED]> >> To: "Vincent Diepeveen" <[EMAIL PROTECTED]> >> Cc: "Beowulf" <beowulf@beowulf.org> >> Sent: Wednesday, August 27, 2008 2:35 AM >> Subject: Re: [Beowulf] gpgpu >> >> >>> In message from Vincent Diepeveen <[EMAIL PROTECTED]> (Tue, 26 Aug 2008 >>> 00:30:30 +0200): >>>> Hi Mikhail, >>>> >>>> I'd say they're ok for black box 32 bits calculations that can do >>>> with >>>> a GB or 2 RAM, >>>> other than that they're just luxurious electric heating. >>> >>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or >>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to >>> BLAS/LAPACK/FFT :-) >>> >>> So I'll need to program something other. People say that the best >>> choice is CUDA for Nvidia. When I look to sgemm source, it has >>> about 1 >>> thousand (or higher) strings in *.cu files. Thereofore I think that a >>> bit more difficult alghorithm as some special matrix diagonalization >>> will require a lot of programming work :-(. >>> >>> It's interesting, that when I read Firestream Brook+ "kernel >>> function" >>> source example - for addition of 2 vectors ("Building a High Level >>> Language Compiler For GPGPU", >>> Bixia Zheng ([EMAIL PROTECTED]) >>> Derek Gladding ([EMAIL PROTECTED]) >>> Micah Villmow ([EMAIL PROTECTED]) >>> June 8th, 2008) >>> >>> - it looks SIMPLE. May be there are a lot of details/source lines >>> which were omitted from this example ? >>> >>> >>>> Vincent >>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is >>>> really >>>> too much. >>> >>> 250 W is TDP, the average value declared is about 160 W. I don't >>> remember, which GPU - from AMD or Nvidia - has a lot of special >>> functional units for sin/cos/exp/etc. If they are not used, may be >>> the >>> power will a bit more lower. >>> >>> What is about Firestream 9250, AMD says about 150 W (although I'm not >>> absolutely sure that it's TDP) - it's as for some >>> Intel Xeon quad-cores chips w/names beginning from X. >>> >>> Mikhail >>> >>> >>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote: >>>> >>>>> BTW, why GPGPUs are considered as vector systems ? >>>>> Taking into account that GPGPUs contain many (equal) execution >>>>> units, >>>>> I think it might be not SIMD, but SPMD model. Or it depends from >>>>> the software tools used (CUDA etc) ? >>>>> >>>>> Mikhail Kuzminsky >>>>> Computer Assistance to Chemical Research Center >>>>> Zelinsky Institute of Organic Chemistry >>>>> Moscow >>>>> _______________________________________________ >>>>> Beowulf mailing list, Beowulf@beowulf.org >>>>> To change your subscription (digest mode or unsubscribe) visit >>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>> >>>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> >> > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf