Re: NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

Vincent Diepeveen Tue, 17 Jun 2008 14:35:04 -0700

Richard,

The big question that we should raise to answer your HPC market iswhether assembler coded x64 SSE2 code also counts.Nothing as weird as all those bizarre SSE2 instructions. There is noway you can make an objective analysis why some weirdo

instructions are in and why some very useful stuff is not in.


Let's zoom in into one detail.

What we really need in SSE2 to really speed up bigtime some FFT typetransforms as well as make it attractive to all kind of small sizedcodes is to have an instruction lowbitsmultiply. If we represent aregister as 2 x 64 bits integers: A:B and we multiply with C:D.

Then we want to be able to execute the next 2 instructions,especially at intel processors:


lowbitsmultiply   :   A:B * C:D ===>   AC (mod 2^64) : BD (mod 2^64)
highbitsmultiply :   A:B * C:D ===>   AC >> 64       : BD >> 64
where '>>' means shiftright 64

Basically in floating point the itanium could have such instructionsas the above,

so why not the PC processors also for 64 bits?

Of course we need these 2 instructions a tad faster than currentlymultiplying an integer is in the integer units.We want a throughput of 1 cycle without blocking other executionunits of course. When further vectorizing theSSE units it is of course not nice when a single instruction in oneof the execution units blocks all others.

There is no real perfect hardware from HPC viewpoint. Fiddling withSSE2 instructions is something very few programmers are goodat in order to get their thing done in a cryptic manner, asalgorithms and caches haven't become simpler since the 80s.

So i would argue that the number crunching market already getsdominated by SSE2 acceleration and will see even more of thattype of stuff, which basically means a return to the 80s in somesense for programmers who want to get the maximum out of a CPU.

Interesting now is when Intel comes with something and AMD with yetunother crunching koncept (YUCK),

taking care more cash gets pocketted by programmers.

So that is good news for the jobless low level programmers (me, saysthe fool) :)

Note if GPU's get equipped with this type of instructions, even whenjust 32 bits,that would already be a big step, as that allows CRT (ChineseRemainder Theorem) to get the job done,

which is a manner to currently solve it at the PC.

Point is that register size and which type of instructions yourhardware supports matters. It especially matters to release which onesyou support. Videocards simply have the luck that our coordinatesystem can get expressed easily with 32 bits floating points.


When needed even 16 bits.

So i'd not wait for a GPU that is entirely 64 bits double precisionand/or 64 bits integers,

as that would slow them down for their biggest market.

Note that it would on paper be possible right now to make achessprogram within a GPU run real fast,if the RAM accesses would be latency optimized (so 2 random accesesof all stream processors, each 10000 cyclesto the RAM, first access being a read, the second one a write). Allother theoretic problems i solved on paper.

Yet this small 'problem' is dominating it that much that i'm notgonna gamble programming in order to find out perhaps

that my assumption is correct :)

However to reveal some of the math to compare the GPU's versus theCPU's when i would design a new 'braindead'

chessprogram:

Let's say we have 240 cores 32 bits at 0.675 Ghz

Let's assume now we need 10k cycles for each node at eachstreamprocessor:(1 nps = 1 node per second = 1 chessposition a second searching forthe holy grail)I assume on average 1 instruction a cycle (dangerous assumptionconsidering there is 2 RAM hits also)


675000k  / 10k     = 67.5k nps
240 * 67.5k nps  = 16.2 million nps

Now if we compare with what i fight against in tournaments,
that is a skulltrail mainboard with 2 xeon cpu's overclocked to 4Ghz,
so 8 cores in total with big RAM.

If i see what practical nps is that fast chessprograms get at itright now,

then that is about 20 million nps.

PC faster than GPU.

I will skip all kind of technical discussions, such as where PC losessomething;its memory controller is not near fast enough to serve every node, soit loses bigtimethere last plies, at least 20-30%, and that we assume the samespeedup for 8 coresversus 240 cores, where game tree search is one of the hardest toparallellize challenges;so you effectively will lose a lot more at 240 cores than at 8 cores.In fact i would guessit's 30% efficiency for the GPU versus 87.5% for the PC. Yet there isthings to discover therewhich make up for an interesting challenge so i skip that algorithmicdiscussion entirely here,

for now.

In short for problems that the past were latency oriented, thedominating factor is:

"how many instructions per second can you execute".

The PC simply wins it from the GPU here, this for a typical 32 bitsproblem (in case of my chess software).The PC simply can effectively execute more instructions a cycle, whenalso counting the instructions

it can skip by taking branches.

Note that the PC also wins it in powerconsumption at 4Ghz @ 2 sockets.

It has taken me many months to redo the above math, trying to find asolution to get somehow GPU faster than PC.

Didn't manage so far.

The biggest 2 differences between a CPU and a GPU when doing suchmath is the low clock of the GPU versus thehigh clock of the CPU. CPU at events already wins a factor 6 nearlyjust based upon clockspeed.

I've showed up at a world championship in 2003 with a supercomputerwith 500Mhz cpu's and fought against opponentsat 2.8Ghz MP Xeon cpu's. Also a factor 6 difference nearly already inclockspeed.


That is really a big dang in your selfconfidence.

Maybe it is good that Greg Lindahl didn't join in that event, hewould've googled a tad and showed up with that

one-liner of Seymour Cray.

Vincent

On Jun 17, 2008, at 9:00 PM, [EMAIL PROTECTED] wrote:

-------------- Original message --------------
From: Jim Lux <[EMAIL PROTECTED]>

> Well.. to be fair, there were (and still are) businesses out there
> (particularly a few years ago) that didn't fully understand the
> concept of needing net profit. (ah yes, the glory days of startups
> "buying market share" in the dot-com bubble) And, some folks made a
> fine living in the mean time. (But, then, those folks weren't the
> owners, were they, or if they were, in a limited sense, they nowhave
> some decorative wallpaper..)
>

Hey Jim,
Gold rushes are good (and greed I guess too ... ;-) ...) ... thereIS often gold in them there hills, it is just that very few ifanyone knows exactly where. So, the less risk averse among us andthose with more money than sense (thankfully, I say) startingdigging. Most of their trials end in error, but the rest of usbenefit from the few that are lucky/smart enough to find it. Ithink you are assuming that the futures are far more predictablethan it in fact is, even for the best and brightest likeyourself ... what percentage of the HPC market will acceleratorshave at this time next year?
Regards,

rbw

--

"Making predictions is hard, especially about the future."

Niels Bohr

--

Richard Walsh
Thrashing River Consulting--
5605 Alameda St.
Shoreview, MN 55126

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

Reply via email to