Re: NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

Vincent Diepeveen Tue, 17 Jun 2008 10:16:11 -0700

Jim,

I feel you notice a very important thing here.

That is that mainly for hobbyists like me a GPU is interesting toprogram for, or for companies who have a budget to buy less than a dozen

of them in total.

For ISPs the only thing that matters is the power consumption and forencryption at a low TCP/IP layer it's too easy to equip all thosecheapo cpu's with encryption coprocessors which are like 1 watt or soand are delivering enough work to get the 100 mbit/1 gigabit nicsfully busy, in case of public key it's in fact at a speed that youwon't reach at a GPU when managing to parallellize it and it to workin a greatmanner. The ISPs look for full scalable stuff of course suchmachines, quite the opposite of having 1 card @ 250 watt.

In fact it would be quite interesting to know how fast you can runRSA on a GPU. Where are the benchmarks there?I tend to remember that i posted some solution to do a fast genericmodulo (of course not a new idea, but that you alwayshear after figuring something out), with a minimum of code under thecondition you already have multiplying code.


How fast can you multiply big numbers on those GPU's?

4096 x 4096 bits is most interesting there. Then of course take themodulo quickly and repeat this for the entire exponent-squaring.

That is the only interesting question IMHO, what amount of throughputdoes it deliver for RSA4096.I tend to remember a big bug it has in such a case is that the oldercards (8800 etc) only can do 16 x 16 bits == 32 bits,whereas at CPU's you can use 64 x 64 bits == 128 bits. BIG differencein speed.

Yet those hobbyists who are the interested persons in GPU programminghave limited timeto get software to run and have a budget far smaller than $10k.They're not even gonna buy as much Tesla's as NASA will.

Not a dozen.

The state in which gpu programming is now is that some big companiescan have a person toy fulltime with 1 gpu,as of course the idea of having a cpu with hundreds of cores is veryattractive and looks like a realistic future,

so companies must explore that future.

Of course every GPU/CPU company is serious in their aim to produceproducts that perform well, we all do not doubt it.

Yet it is only attractive to hobbyists and those hobbyists are notgonna get any interesting technical data needed to get the maximumout of the GPU's from Nvidia. This is a big problem. Those hobbyistshave very limited time to get their sparetime products doneto do numbercrunching, so being busy fulltime writing testprograms toknow everything about 1 specific GPU is not something theyall like to do for a hobby. Just having that information will attractthe hobbyists as they are willing to take the risk to buy 1 Tesla andspend time there. That produces software. That software will have acertain performance,

based upon that performance perhaps some companies might get interested.

Intel and AMD will be doing better there i hope.

Note that testing CUDA also is suboptimal, it just runs for 5 secondsor so max. You need a machine with a 2nd videocard.That requires a mainboard with at least 2x pci-e 16x. How to clusterthat? My cluster cards are pci-x not pci-e. quadrics QM400's.

I can get boards @ 139 euro with 1 slot PCI-E 16x and build quadcoreQ6600 nodes @ 500 euro, as soon as i have a job again.

My macbookpro 17'' has no pci-e 16x slot free though.

So for number crunching, a cluster always wins it from a singlenvidia videocard.The communication speed over the pci-e from the videocards is tooslow latency to

parallellize software that is not-embarrassingly parallel.

Majority of hobbyists will have a similar problem with nvidia, verysad in itself.

A good CUDA setup that can beat a simplistic cluster is not so cheapand easy to program for like buildingthat cluster is. Also the memory scales better in those clusters thanit does for the cards. If 1 node can do lesswork than 1 GPU can, it still means that it's easier to get thatexponential speedup by having a shared cache across

all nodes (this is true for a lot of modern crunching algorithms).

With a GPU you're forced to do all calculation including cachingwithin the GPU and within the limited device RAM.Now in contradiction to what most people tend to believe, usuallythere is methods to get away with a limited amountof RAM with modern overwriting methods of caching, even when thatloses a factor 2 there is ways to overcome that.The biggest limitation is that communication with other nodes is realhard.

Scaling to more nodes is just not gonna happen of course as thelatency between the nodes is real bad and it isan extra slow hop to latency of course. First from device RAM to RAMthen from RAM to card and from card to RAM and

from RAM to device RAM.

Let's make a list of problems that most clusters here calculate uponand you'll see how much the GPU concept still needs

to get matured to get it to work well for most codes.

Software that needs low latency interconnects you could get to workwithin 1 card only therefore, provided the RAM access is notbottlenecked.Yet all reports so far indicate it is. No information there is justnot very encouraging and for professional crunching work where companiestherefore have a big budget for, building or buying in your own lowpower co-processor that so to speak even fits into an ipod is justtoo easy.

So in the end i guess some stupid extension of SSE will give a biggerincrease in crunching power than the in itself attractive gpgpuhardware.

The biggest limitation being development time from hobbyists.

Vincent

On Jun 17, 2008, at 4:01 PM, Jim Lux wrote:

Quoting Linus Harling <[EMAIL PROTECTED]>, on Mon 16 Jun 200804:31:56 PM PDT:
Vincent Diepeveen skrev:
<snip>
Then instead of a $200 pci-e card, we needed to buy expensiveTesla's
for that, without getting
very relevant indepth technical information on how to program forthat
type of hardware.

The few trying on those Tesla's, though they won't ever post this as
their job is fulltime GPU programming,
report so far very dissappointing numbers for applications thatreally
matter for our nations.
</snip>

Tomography is kind of important to a lot of people:

http://tech.slashdot.org/tech/08/05/31/1633214.shtml
http://www.dvhardware.net/article27538.html
http://fastra.ua.ac.be/en/index.html

But of course, that was done with regular $500 cards, not Teslas.
Mind you, if you go and get a tomographic scan today, they alreadyuse fast hardware to do it. Only researchers on limited budgetstolerate taking days to reduce the data on a desktop PC. And, whilethe concept of doing faster processing with a <10KEuro box isattractive in that environment, I suspect it's a long way frombeing commercially viable in that role.
The current tomographic technology (e.g. GE Lightspeed) is prettyimpressive. They slide you in, and 10-15 seconds later, there's 3d rendered models and slices on the screen. The equipment ispretty hassle free, the UI straightforward from what I could see, etc.
And, of course, people are willing (currently) to pay many millionsfor a machine to do this. I suspect that the other costs ofrunning a CT scanner (both capital and operating) overwhelm thecost of the computing power, so going from a $100K box to a $20Kbox is a drop in the bucket. When you're talking MRI, forinstance, there's the cost of the liquid helium for the magnets.
That's a long way from a bunch of grad students racking up a bunchof PCs.


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

Reply via email to