Richard,

The big question that we should raise to answer your HPC market is whether assembler coded x64 SSE2 code also counts. Nothing as weird as all those bizarre SSE2 instructions. There is no way you can make an objective analysis why some weirdo
instructions are in and why some very useful stuff is not in.

Let's zoom in into one detail.

What we really need in SSE2 to really speed up bigtime some FFT type transforms as well as make it attractive to all kind of small sized codes is to have an instruction lowbitsmultiply. If we represent a register as 2 x 64 bits integers: A:B and we multiply with C:D.

Then we want to be able to execute the next 2 instructions, especially at intel processors:

lowbitsmultiply   :   A:B * C:D ===>   AC (mod 2^64) : BD (mod 2^64)
highbitsmultiply :   A:B * C:D ===>   AC >> 64       : BD >> 64
where '>>' means shiftright 64

Basically in floating point the itanium could have such instructions as the above,
so why not the PC processors also for 64 bits?

Of course we need these 2 instructions a tad faster than currently multiplying an integer is in the integer units. We want a throughput of 1 cycle without blocking other execution units of course. When further vectorizing the SSE units it is of course not nice when a single instruction in one of the execution units blocks all others.

There is no real perfect hardware from HPC viewpoint. Fiddling with SSE2 instructions is something very few programmers are good at in order to get their thing done in a cryptic manner, as algorithms and caches haven't become simpler since the 80s.

So i would argue that the number crunching market already gets dominated by SSE2 acceleration and will see even more of that type of stuff, which basically means a return to the 80s in some sense for programmers who want to get the maximum out of a CPU.

Interesting now is when Intel comes with something and AMD with yet unother crunching koncept (YUCK),
taking care more cash gets pocketted by programmers.

So that is good news for the jobless low level programmers (me, says the fool) :)

Note if GPU's get equipped with this type of instructions, even when just 32 bits, that would already be a big step, as that allows CRT (Chinese Remainder Theorem) to get the job done,
which is a manner to currently solve it at the PC.

Point is that register size and which type of instructions your hardware supports matters. It especially matters to release which ones you support. Videocards simply have the luck that our coordinate system can get expressed easily with 32 bits floating points.

When needed even 16 bits.

So i'd not wait for a GPU that is entirely 64 bits double precision and/or 64 bits integers,
as that would slow them down for their biggest market.

Note that it would on paper be possible right now to make a chessprogram within a GPU run real fast, if the RAM accesses would be latency optimized (so 2 random acceses of all stream processors, each 10000 cycles to the RAM, first access being a read, the second one a write). All other theoretic problems i solved on paper.

Yet this small 'problem' is dominating it that much that i'm not gonna gamble programming in order to find out perhaps
that my assumption is correct :)

However to reveal some of the math to compare the GPU's versus the CPU's when i would design a new 'braindead'
chessprogram:

Let's say we have 240 cores 32 bits at 0.675 Ghz

Let's assume now we need 10k cycles for each node at each streamprocessor: (1 nps = 1 node per second = 1 chessposition a second searching for the holy grail) I assume on average 1 instruction a cycle (dangerous assumption considering there is 2 RAM hits also)

675000k  / 10k     = 67.5k nps
240 * 67.5k nps  = 16.2 million nps

Now if we compare with what i fight against in tournaments,
that is a skulltrail mainboard with 2 xeon cpu's overclocked to 4Ghz,
so 8 cores in total with big RAM.

If i see what practical nps is that fast chessprograms get at it right now,
then that is about 20 million nps.

PC faster than GPU.

I will skip all kind of technical discussions, such as where PC loses something; its memory controller is not near fast enough to serve every node, so it loses bigtime there last plies, at least 20-30%, and that we assume the same speedup for 8 cores versus 240 cores, where game tree search is one of the hardest to parallellize challenges; so you effectively will lose a lot more at 240 cores than at 8 cores. In fact i would guess it's 30% efficiency for the GPU versus 87.5% for the PC. Yet there is things to discover there which make up for an interesting challenge so i skip that algorithmic discussion entirely here,
for now.

In short for problems that the past were latency oriented, the dominating factor is:
"how many instructions per second can you execute".

The PC simply wins it from the GPU here, this for a typical 32 bits problem (in case of my chess software). The PC simply can effectively execute more instructions a cycle, when also counting the instructions
it can skip by taking branches.

Note that the PC also wins it in powerconsumption at 4Ghz @ 2 sockets.

It has taken me many months to redo the above math, trying to find a solution to get somehow GPU faster than PC.
Didn't manage so far.

The biggest 2 differences between a CPU and a GPU when doing such math is the low clock of the GPU versus the high clock of the CPU. CPU at events already wins a factor 6 nearly just based upon clockspeed.

I've showed up at a world championship in 2003 with a supercomputer with 500Mhz cpu's and fought against opponents at 2.8Ghz MP Xeon cpu's. Also a factor 6 difference nearly already in clockspeed.

That is really a big dang in your selfconfidence.

Maybe it is good that Greg Lindahl didn't join in that event, he would've googled a tad and showed up with that
one-liner of Seymour Cray.

Vincent

On Jun 17, 2008, at 9:00 PM, [EMAIL PROTECTED] wrote:


-------------- Original message --------------
From: Jim Lux <[EMAIL PROTECTED]>

> Well.. to be fair, there were (and still are) businesses out there
> (particularly a few years ago) that didn't fully understand the
> concept of needing net profit. (ah yes, the glory days of startups
> "buying market share" in the dot-com bubble) And, some folks made a
> fine living in the mean time. (But, then, those folks weren't the
> owners, were they, or if they were, in a limited sense, they now have
> some decorative wallpaper..)
>

Hey Jim,

Gold rushes are good (and greed I guess too ... ;-) ...) ... there IS often gold in them there hills, it is just that very few if anyone knows exactly where. So, the less risk averse among us and those with more money than sense (thankfully, I say) starting digging. Most of their trials end in error, but the rest of us benefit from the few that are lucky/smart enough to find it. I think you are assuming that the futures are far more predictable than it in fact is, even for the best and brightest like yourself ... what percentage of the HPC market will accelerators have at this time next year?

Regards,

rbw

--

"Making predictions is hard, especially about the future."

Niels Bohr

--

Richard Walsh
Thrashing River Consulting--
5605 Alameda St.
Shoreview, MN 55126

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to