Re: Re[2]: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters

Vincent Diepeveen Thu, 20 Nov 2008 09:49:58 -0800


On Nov 20, 2008, at 5:39 PM, Jan Heichler wrote:

Hallo Mark,



Donnerstag, 20. November 2008, meintest Du:



>> [shameless plug]
>> A project I have spent some time with is showing 117x on a 3-GPUmachine over
>> a single core of a host machine (3.0 GHz Opteron 2222). Thecode is
>> mpihmmer, and the GPU version of it. See http://www.mpihmmer.org for more
>> details.  Ping me offline if you need more info.



>> [/shameless plug]
MH> I'm happy for you, but to me, you're stacking the deck bycomparing to a
MH> quite old CPU. you could break out the prices directly, butcomparing 3x
MH> GPU (modern? sounds like pci-express at least) to a currententry-level
MH> cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be moreappropriate.
Instead of benchmarking some CPU vs. some GPU wouldn't it be fairer to



a) compare systems of similar costs (1k, 2k, 3k EUR/USD)

b) compare systems with a similar power footprint



?



What does it help that 3 GPUs are 1000x faster than a Asus Eee PC?


Exactly.

http://re.jrc.ec.europa.eu/energyefficiency/html/standby_initiative_data%20centers.htm

The correct comparision is comparing power usage, as that is what is'hot' these days.Just plain cash money compare is not enough. Weird yet true. In 3dworld nations like forexample China, India power is not a concern at all, not forgovernment related tasks either.

The slow adaptation to manycores, even for workloads that would dowell on them (just in theory),

is definitely limited by portability.

Had some ESA dude on the phone a few days ago. I heard the word"portability" just a bit too much.That's why they do just too much with ugly slow JAVA code. Not fastenough at 1 pc?

Put another 100 there.

I was told exactly the same reasoning (portability problem) for otherprojects where i tried to sneak in GPUcomputing (regardless which manufacturer). Portability was also theKILLER there.

If you write burocratic paper documents then CUDA is not portable andnever will be of course, as the hardware

is simply different from a CPU.

Yet that code must be portable between oldie Sun, UNIX type machinesand modern quadcores as well as new GPU

hardware, inc ase you want to introduce GPU's. Not realistic of course.

Just enjoy the speedup i'd say, if you can get it.

They can spend millions on hardware, but not even a couple ofhundreds of thousands on customized softwareto solve the problem of portability by having a plugin that is doingthe crunching just for gpu's.


Idiotic yet that's the sole truth.

So to speak, manycores will only make it in there when NASA writes abig article online bragging how fast theirsupercomputing codes are at todays gpu's where they own a 100k fromto do number crunching.

I would argue for workloads favourable to GPU's, which is just a veryfew as of now,NVIDIA/AMD is up to 10x faster than a quadcore, if you know how toget it out of the card.

Probably gpgpu for now is the cheap alternative for a few veryspecific tasks of 3d world nations therefore.


May they lead us in the path ahead...

In itself very funny that burocratic reasons (portability) is thebiggest problem limiting progress.

When you speak to hardware designers about say for example 32 corecpu's, they laugh loud.The only scalable hardware for now at 1 cpu giving a big punch, itseems to be manycores.

All those managers simply have put their mind in a big storage bunkerwhere alternatives are not allowed in.Even an economic crisis will not help it. They have to get bombardedwith actual products that are interestingto them, that get a huge speedup at GPU's, to start understanding theadvantage of it.

The few who do understand already, they all keep their stuff sosecret, and usually guys who are not exactly verygood in parallellization may "try out" the GPU in question. That'sanother recipe for disaster of course.

Logically that they never even get a speedup over a simple quadcore.If you compare assembler levelSSE2 (modified intel primitives in SSE2 so you want) with a clumsyguy (not in his own thinking) who tries out

the GPU for a few weeks, obviously it is gonna fail.

Something algorithmic optimized for like 20-30 years now for pc typehardware, that suddenly must get ported

within a few weeks to GPU. There is not many who can do that.

You need complete different algorithmic approach for that. Somethingthat is memory bound CAN get rewrittento cpu bound. Sometimes even without losing speed. Just because theydidn't have the luxury of such huge

cpu crunching power, they never tried!

But that optimization step of 20 years is a big limit to GPU's.

Add to it that intel is used to GIVE AWAY hardware to developers.
I'll have to see nvidia do that.

If those same guys as the above guys who failed, have that hardwarefor years at home,

they MIGHT get to some ideas and tell their boss.

It's those reports of those guys currently which adds to the storagebunker thinking.


It is wrong to assume that experts can predict the future.

Vincent

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: Re[2]: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters

Reply via email to