Jim,

I feel you notice a very important thing here.

That is that mainly for hobbyists like me a GPU is interesting to program for, or for companies who have a budget to buy less than a dozen
of them in total.

For ISPs the only thing that matters is the power consumption and for encryption at a low TCP/IP layer it's too easy to equip all those cheapo cpu's with encryption coprocessors which are like 1 watt or so and are delivering enough work to get the 100 mbit/1 gigabit nics fully busy, in case of public key it's in fact at a speed that you won't reach at a GPU when managing to parallellize it and it to work in a great manner. The ISPs look for full scalable stuff of course such machines, quite the opposite of having 1 card @ 250 watt.

In fact it would be quite interesting to know how fast you can run RSA on a GPU. Where are the benchmarks there? I tend to remember that i posted some solution to do a fast generic modulo (of course not a new idea, but that you always hear after figuring something out), with a minimum of code under the condition you already have multiplying code.

How fast can you multiply big numbers on those GPU's?
4096 x 4096 bits is most interesting there. Then of course take the modulo quickly and repeat this for the entire exponent-squaring.

That is the only interesting question IMHO, what amount of throughput does it deliver for RSA4096. I tend to remember a big bug it has in such a case is that the older cards (8800 etc) only can do 16 x 16 bits == 32 bits, whereas at CPU's you can use 64 x 64 bits == 128 bits. BIG difference in speed.

Yet those hobbyists who are the interested persons in GPU programming have limited time to get software to run and have a budget far smaller than $10k. They're not even gonna buy as much Tesla's as NASA will.
Not a dozen.

The state in which gpu programming is now is that some big companies can have a person toy fulltime with 1 gpu, as of course the idea of having a cpu with hundreds of cores is very attractive and looks like a realistic future,
so companies must explore that future.

Of course every GPU/CPU company is serious in their aim to produce products that perform well, we all do not doubt it.

Yet it is only attractive to hobbyists and those hobbyists are not gonna get any interesting technical data needed to get the maximum out of the GPU's from Nvidia. This is a big problem. Those hobbyists have very limited time to get their sparetime products done to do numbercrunching, so being busy fulltime writing testprograms to know everything about 1 specific GPU is not something they all like to do for a hobby. Just having that information will attract the hobbyists as they are willing to take the risk to buy 1 Tesla and spend time there. That produces software. That software will have a certain performance,
based upon that performance perhaps some companies might get interested.

Intel and AMD will be doing better there i hope.

Note that testing CUDA also is suboptimal, it just runs for 5 seconds or so max. You need a machine with a 2nd videocard. That requires a mainboard with at least 2x pci-e 16x. How to cluster that? My cluster cards are pci-x not pci-e. quadrics QM400's.

I can get boards @ 139 euro with 1 slot PCI-E 16x and build quadcore Q6600 nodes @ 500 euro, as soon as i have a job again.
My macbookpro 17'' has no pci-e 16x slot free though.

So for number crunching, a cluster always wins it from a single nvidia videocard. The communication speed over the pci-e from the videocards is too slow latency to
parallellize software that is not-embarrassingly parallel.

Majority of hobbyists will have a similar problem with nvidia, very sad in itself.

A good CUDA setup that can beat a simplistic cluster is not so cheap and easy to program for like building that cluster is. Also the memory scales better in those clusters than it does for the cards. If 1 node can do less work than 1 GPU can, it still means that it's easier to get that exponential speedup by having a shared cache across
all nodes (this is true for a lot of modern crunching algorithms).

With a GPU you're forced to do all calculation including caching within the GPU and within the limited device RAM. Now in contradiction to what most people tend to believe, usually there is methods to get away with a limited amount of RAM with modern overwriting methods of caching, even when that loses a factor 2 there is ways to overcome that. The biggest limitation is that communication with other nodes is real hard.

Scaling to more nodes is just not gonna happen of course as the latency between the nodes is real bad and it is an extra slow hop to latency of course. First from device RAM to RAM then from RAM to card and from card to RAM and
from RAM to device RAM.

Let's make a list of problems that most clusters here calculate upon and you'll see how much the GPU concept still needs
to get matured to get it to work well for most codes.

Software that needs low latency interconnects you could get to work within 1 card only therefore, provided the RAM access is not bottlenecked. Yet all reports so far indicate it is. No information there is just not very encouraging and for professional crunching work where companies therefore have a big budget for, building or buying in your own low power co-processor that so to speak even fits into an ipod is just too easy.

So in the end i guess some stupid extension of SSE will give a bigger increase in crunching power than the in itself attractive gpgpu hardware.
The biggest limitation being development time from hobbyists.

Vincent

On Jun 17, 2008, at 4:01 PM, Jim Lux wrote:

Quoting Linus Harling <[EMAIL PROTECTED]>, on Mon 16 Jun 2008 04:31:56 PM PDT:

Vincent Diepeveen skrev:
<snip>

Then instead of a $200 pci-e card, we needed to buy expensive Tesla's
for that, without getting
very relevant indepth technical information on how to program for that
type of hardware.

The few trying on those Tesla's, though they won't ever post this as
their job is fulltime GPU programming,
report so far very dissappointing numbers for applications that really
matter for our nations.
</snip>

Tomography is kind of important to a lot of people:

http://tech.slashdot.org/tech/08/05/31/1633214.shtml
http://www.dvhardware.net/article27538.html
http://fastra.ua.ac.be/en/index.html

But of course, that was done with regular $500 cards, not Teslas.


Mind you, if you go and get a tomographic scan today, they already use fast hardware to do it. Only researchers on limited budgets tolerate taking days to reduce the data on a desktop PC. And, while the concept of doing faster processing with a <10KEuro box is attractive in that environment, I suspect it's a long way from being commercially viable in that role.

The current tomographic technology (e.g. GE Lightspeed) is pretty impressive. They slide you in, and 10-15 seconds later, there's 3 d rendered models and slices on the screen. The equipment is pretty hassle free, the UI straightforward from what I could see, etc.

And, of course, people are willing (currently) to pay many millions for a machine to do this. I suspect that the other costs of running a CT scanner (both capital and operating) overwhelm the cost of the computing power, so going from a $100K box to a $20K box is a drop in the bucket. When you're talking MRI, for instance, there's the cost of the liquid helium for the magnets.

That's a long way from a bunch of grad students racking up a bunch of PCs.








_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to