Re: [Beowulf] any gp-gpu clusters?

Vincent Diepeveen Sat, 23 Jun 2007 09:14:00 -0700

Hello Mark,

Well i've been past few weeks investigating cards and what it seems
is that so far the marketing department is far ahead of actual performance.

At this 8800 card the fastest FFT that i could find is claiming 100 gflopout of a very expensive 8800 card

that on paper should deliver nearly half a teraflop.

That is quite dissappointing.

Then we didn't even investigate that FFT yet, as it seems to do somethingthat most of us don't need at all,what we all need is far more complicated to get really well to work on thosecards.

We also didn't discuss even how to do big matrix calculations knowing thecomplexity of implementing this into the

architecture.

You mention a thought that many have had already, namely if you build acluster, that within a year or 2 you can

quite easily upgrade the cards in each node.

Though this sounds interesting, right now a single card isn't deliveringmore than what a quadcore can deliver you,

whereas this quadcore can do much more and can use more RAM.

When the power6 system got presented in Amsterdam a week ago (40 Tflop in2008, right now it's power5 and 14 Tflop),i still can remember how one scientist was very happy with the 64GB of ramthat each node has, as RAM speeds his calculations up more than additionallyprocessing power.

So he for sure won't line up for calculating within videocards with limitedRAM.

If you plan to put a card or 4 into a single node, please realize that asingle quadcore node eats about 172 watt (when not using videocard nor i/o)or 180 watt when using a videocard, this with all 4 cores at full usage.

This where a single videocard is having a TDP of far over 200 watt, so atfull usage.

If you plan to put in a videocard or 4 @ 225 watt each, you have somemonster of an energy bill in return.

The easiest programming language (CUDA) also delivers the smallest amount ofperformance it seems,

versus ATI's 2900 card.

The advantages of using a bunch of videocards in a single node is basicallynext:

a) the speculation that the next generation videocards from ATI and NVIDIAwill deliver great performance for those

who can use the card

b) the theoretic possibility to save upon network costs, as the network isbasically a pci-e 16x slot at the mainboard.

So where one card is perhaps nearly equal to a quadcore, just on paper, forsomething that needs very little RAM; it is obvious that if you put in 4 ofthose cards that you still just need 1 network card in the node to connectthe network.

c) on paper it would be possible that nodes equipped with 2 videocards, 1simple card to adress the system and 1 card to do calculations upon, can beused by 2 users at the same time. One person could use on paper thevideocard and the other one the rest of the node. This is however wishfulthinking as of now. Which university is going to put in a monster that eats200 watt or so at full performance and that just 1 or 2 users can use?


There is however a few weaknesses that remains:

a) you need n+1 cards in a system to use n cards for calculations

b) The measured latency, so not theoretic but practical latency measuredhere, between RAM and cardsRAM are far worse than that network cards deliver; 50 us roughly for the8800 versus 1.5 us roughly for network cards one-way ping pong latency.

The bandwidth is not better either and with several cards a node that'lldeteriorate probably.

c) the limited amount of RAM on-card and the huge price for cards that dohave more than half a gigabyte DDR3,

nvidia's high clocked cards really are quite expensive.

d) the huge mass production that ATI and NVIDIA must achieve in order tosell those cards to keep price a bit affordable instead of thousands a cardis counter productive in our direction. For just graphics all they need issingle precision floating point, whereas the few guys (that's people in thisbeowulf list) who want a card that is programmable like a cpu and use it forDSP type workloads is quite limited. They need to produce and sell tens ofmillions of those cards so selling a couple of thousands to calculation typeworkloads is not real interesting to ati/nvidia and it is rather wishfulthinking that cards will get really optimized for what we really need.

e) it is very hard to get information about the cards, like how caches work,yes it's not even clear how BIG caches are on a card and what bottlenecksare on the cards. So programming for those cards in a manner that HPC needs,namely getting the utmost performance out of it, is total impossible to dowith some generic programming language. It requires complete fulltimededication to do so, have friends at nvidia or ati to get more info and soon. It is very specialistic work in short.

This is currently by far the biggest obstacle to start programming for thosecards.

f) the few attempts that have been tried so far had very dissappointingresults for whatever reason, the lack of information basically means thatthe huge marketing balloons of ATI and NVIDIA promising nearly half ateraflop now a card are just not even close to reality. Every project on itso far has failed to deliver more performance than existing generic codealready delivers at c2q.

That said, on paper there is a theoretic possibility that such cards infuture (perhaps end 2007) get huge Teraflop capabilities single precision,which cpu's won't have any soon, so keeping an eye on them is veryinteresting. As of now the graphics cards are simply our only hope to getgreat gflop capabilities for a small price.


Giving up that dream not many of us will want to do.

Yet so far it is a mystery how to beat a 3Ghz core2 @ 16 cores dual Xeonnode with a big L2/L3 with such a graphics card that has such tiny cachesand is lobotomized everywhere so that the total number of instructions itcan process on paper simply can never be true?

To keep objective, ATI's latest 2900 card has 64 streaming processors whichATI markets as 320 by the way, lying directly factor 5, and is just 742Mhzclocked. So you start at a disadvantage against core2 of a factor: 2.4Ghz /0.742 = 3.2

So you must somewhere win a factor 3.2 to just *keep the same speed* foryour code.

This where at 22 july the 2.4ghz quadcore drops to 266 dollar whereas theati2900 is currently priced nearly 400 EURO here.

It is very hard to compete when you already must make up for a factor 3+ tostart with.That 4.7Ghz power6 is far more interesting in that sense, yet i know inadvance i won't get any system time at it,

whereas i CAN buy a videocard for a couple of hundreds of euro's.

The future will provide answers therefore whether future graphics chips cankick butt for a small price, i sure hope so.


Thanks,
Vincent

----- Original Message -----From: "Mark Hahn" <[EMAIL PROTECTED]>

To: "Beowulf Mailing List" <Beowulf@beowulf.org>
Sent: Thursday, June 21, 2007 4:57 PM
Subject: [Beowulf] any gp-gpu clusters?

Hi all,
is anyone messing with GPU-oriented clusters yet?
I'm working on a pilot which I hope will be something like 8xworkstations, each with 2x recent-gen gpu cards.
the goal would be to host cuda/rapidmind/ctm-type gp-gpu development.
part of the motive here is just to create a gpu-friendly infrastructureinto which commodity cards can be added and refreshed every 8-12 months.as opposed to "investing" in quadro-level cards which are too expensiveenough to toss when obsoleted.
nvidia's 1U tesla (with two g80 chips) looks potentially attractive,
though I'm guessing it'll be premium/quadro-priced - not really in keepingwith the hyper-moore's-law mantra...
if anyone has experience with clustered gp-gpu stuff, I'm interested incomments on particular tools, experiences, configuration of the hostmachines and networks, etc. for instance, is it naive to think thatgp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't
necessarily need a hefty (IB, 10Geth) network?

thanks, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] any gp-gpu clusters?

Reply via email to