Hello Mark,

Well i've been past few weeks investigating cards and what it seems
is that so far the marketing department is far ahead of actual performance.

At this 8800 card the fastest FFT that i could find is claiming 100 gflop out of a very expensive 8800 card
that on paper should deliver nearly half a teraflop.

That is quite dissappointing.

Then we didn't even investigate that FFT yet, as it seems to do something that most of us don't need at all, what we all need is far more complicated to get really well to work on those cards.

We also didn't discuss even how to do big matrix calculations knowing the complexity of implementing this into the
architecture.

You mention a thought that many have had already, namely if you build a cluster, that within a year or 2 you can
quite easily upgrade the cards in each node.

Though this sounds interesting, right now a single card isn't delivering more than what a quadcore can deliver you,
whereas this quadcore can do much more and can use more RAM.

When the power6 system got presented in Amsterdam a week ago (40 Tflop in 2008, right now it's power5 and 14 Tflop), i still can remember how one scientist was very happy with the 64GB of ram that each node has, as RAM speeds his calculations up more than additionally processing power.

So he for sure won't line up for calculating within videocards with limited RAM.

If you plan to put a card or 4 into a single node, please realize that a single quadcore node eats about 172 watt (when not using videocard nor i/o) or 180 watt when using a videocard, this with all 4 cores at full usage.

This where a single videocard is having a TDP of far over 200 watt, so at full usage.

If you plan to put in a videocard or 4 @ 225 watt each, you have some monster of an energy bill in return.

The easiest programming language (CUDA) also delivers the smallest amount of performance it seems,
versus ATI's 2900 card.

The advantages of using a bunch of videocards in a single node is basically next:

a) the speculation that the next generation videocards from ATI and NVIDIA will deliver great performance for those
who can use the card

b) the theoretic possibility to save upon network costs, as the network is basically a pci-e 16x slot at the mainboard.

So where one card is perhaps nearly equal to a quadcore, just on paper, for something that needs very little RAM; it is obvious that if you put in 4 of those cards that you still just need 1 network card in the node to connect the network.

c) on paper it would be possible that nodes equipped with 2 videocards, 1 simple card to adress the system and 1 card to do calculations upon, can be used by 2 users at the same time. One person could use on paper the videocard and the other one the rest of the node. This is however wishful thinking as of now. Which university is going to put in a monster that eats 200 watt or so at full performance and that just 1 or 2 users can use?

There is however a few weaknesses that remains:

a) you need n+1 cards in a system to use n cards for calculations

b) The measured latency, so not theoretic but practical latency measured here, between RAM and cards RAM are far worse than that network cards deliver; 50 us roughly for the 8800 versus 1.5 us roughly for network cards one-way ping pong latency.

The bandwidth is not better either and with several cards a node that'll deteriorate probably.

c) the limited amount of RAM on-card and the huge price for cards that do have more than half a gigabyte DDR3,
nvidia's high clocked cards really are quite expensive.

d) the huge mass production that ATI and NVIDIA must achieve in order to sell those cards to keep price a bit affordable instead of thousands a card is counter productive in our direction. For just graphics all they need is single precision floating point, whereas the few guys (that's people in this beowulf list) who want a card that is programmable like a cpu and use it for DSP type workloads is quite limited. They need to produce and sell tens of millions of those cards so selling a couple of thousands to calculation type workloads is not real interesting to ati/nvidia and it is rather wishful thinking that cards will get really optimized for what we really need.

e) it is very hard to get information about the cards, like how caches work, yes it's not even clear how BIG caches are on a card and what bottlenecks are on the cards. So programming for those cards in a manner that HPC needs, namely getting the utmost performance out of it, is total impossible to do with some generic programming language. It requires complete fulltime dedication to do so, have friends at nvidia or ati to get more info and so on. It is very specialistic work in short.

This is currently by far the biggest obstacle to start programming for those cards.

f) the few attempts that have been tried so far had very dissappointing results for whatever reason, the lack of information basically means that the huge marketing balloons of ATI and NVIDIA promising nearly half a teraflop now a card are just not even close to reality. Every project on it so far has failed to deliver more performance than existing generic code already delivers at c2q.

That said, on paper there is a theoretic possibility that such cards in future (perhaps end 2007) get huge Teraflop capabilities single precision, which cpu's won't have any soon, so keeping an eye on them is very interesting. As of now the graphics cards are simply our only hope to get great gflop capabilities for a small price.

Giving up that dream not many of us will want to do.

Yet so far it is a mystery how to beat a 3Ghz core2 @ 16 cores dual Xeon node with a big L2/L3 with such a graphics card that has such tiny caches and is lobotomized everywhere so that the total number of instructions it can process on paper simply can never be true?

To keep objective, ATI's latest 2900 card has 64 streaming processors which ATI markets as 320 by the way, lying directly factor 5, and is just 742Mhz clocked. So you start at a disadvantage against core2 of a factor: 2.4Ghz / 0.742 = 3.2

So you must somewhere win a factor 3.2 to just *keep the same speed* for your code.

This where at 22 july the 2.4ghz quadcore drops to 266 dollar whereas the ati2900 is currently priced nearly 400 EURO here.

It is very hard to compete when you already must make up for a factor 3+ to start with. That 4.7Ghz power6 is far more interesting in that sense, yet i know in advance i won't get any system time at it,
whereas i CAN buy a videocard for a couple of hundreds of euro's.

The future will provide answers therefore whether future graphics chips can kick butt for a small price, i sure hope so.

Thanks,
Vincent

----- Original Message ----- From: "Mark Hahn" <[EMAIL PROTECTED]>
To: "Beowulf Mailing List" <Beowulf@beowulf.org>
Sent: Thursday, June 21, 2007 4:57 PM
Subject: [Beowulf] any gp-gpu clusters?


Hi all,
is anyone messing with GPU-oriented clusters yet?

I'm working on a pilot which I hope will be something like 8x workstations, each with 2x recent-gen gpu cards.
the goal would be to host cuda/rapidmind/ctm-type gp-gpu development.

part of the motive here is just to create a gpu-friendly infrastructure into which commodity cards can be added and refreshed every 8-12 months. as opposed to "investing" in quadro-level cards which are too expensive enough to toss when obsoleted.

nvidia's 1U tesla (with two g80 chips) looks potentially attractive,
though I'm guessing it'll be premium/quadro-priced - not really in keeping with the hyper-moore's-law mantra...

if anyone has experience with clustered gp-gpu stuff, I'm interested in comments on particular tools, experiences, configuration of the host machines and networks, etc. for instance, is it naive to think that gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't
necessarily need a hefty (IB, 10Geth) network?

thanks, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to