Re: [Beowulf] Nehalem and Shanghai code performance for our rzf example

Vincent Diepeveen Mon, 19 Jan 2009 17:40:16 -0800

hi Bill

I'm not limited by knowledge on materials unlike you.

I'd argue if something gives 10x the clockrate it destroys everythingof course,

even at 1/10 th of the the transistor capacity.


Current status is:

Phenom2 overclocks better and is real cheap and when programmed reallow level near assemblerlevel it's having a faster IPC than Nehalem. Especially in SSE2 typecodes it's dominant. Just the compilerfools you, it's intel friendly, to say polite. That seems currentstatus.

Yet objectively, Q6600 was a quantum leap forward. A brilliantdesign, when it released.

Connected L2's or not connected, who cares when it delivers a big punch?

Testsetprogram tricks like hyperthreading, have seen this, done that.It doesn't work for most HPC type

workloads. Just makes timing your software more complicated.

All the cpu's are still 4 cores, that's reality. I don't see progressin multicores.

Newer processtechnology from 65 nm to 45 nm, hopefully it producescpu's cheaper, yet it hardlyclocks a lot higher at production level. Only for watercooledoverclockers it makes AMD suddenly veryattractive now, yet that's not how clusters get build usually (mycluster probably is a big exception anyway,

it has 1 node currently to give one example).

Nehalem hardly is better performing IPC wise than Q6600 for integerworkloads, and it is doing so

at a huge powercost. Phenom2 in fact

is 0% better integerwise than Phenom1. Even more disappointing inthat respect. Just its price is cool.factor 4 cheaper than Nehalem 3.2Ghz Nearly factor 5. And just 200Mhzlower default clock.


I'm quite disappointed by the new cpu's from intel and amd to be honest.

The way these manufacturers 'fix' performance on paper is by usingspecial testsuites.Most testsets are too L3 oriented and too much subject ofoptimization of compiler teams.

If you have a $100 billion company and under 100 'test programs'that's what you get,then for such a huge company with such a huge compiler team it's tooeasy to bust everything.

Current new generation cpu's are faster on paper, in reality theyaren't.

     "paper supports everything"

A $100 billion companies will bust every test and manage tomanipulate in new tests with adatasize that benefits big L3's whereas in reality big L3's are justnot needed for HPC.

That's just total ballony for matrix calculations, CFD whatever.Either your code hardly gets inside L3,or you need that much gigabytes of RAM that L3 doesn't matter either.A few mb's is enough.

4 MB versus 16MB is no big deal simply.

Only some 'chosen' working set sizes benefit to L3.

A 20Ghz PhenomGAaS will of course destroy everything.

As explained however, that doesn't really matter, because L3's eatrelative little power compared to

the execution logics, so that is a big bummer in that case.

My plans for a 128 core (each core low power) multiprocessor, whichallows easy porting of HPC codes toit, as i voted for say 50% of the total ram assigned to each corelocal through a local L2 (total not-shared withthe other cores) and a very slow, possibly even off-chip L3 cache toa shared memory (the other 50% of the RAM),it got laughed away by some intel fanboys. If that's the case thenintel is dead in HPC of course as nvidia and AMDwill take over with GPU type supercomputers. I tend to have morefaith in engineers though than the fanboys doand more than most professors are. I believe in new solutions, not invicious circles that were the past.

A manycore is really complicated to write efficient algorithms for,whereas some modified multicore type cpu,

is easy to port codes to.

I'd argue approaching things from software viewpoint: WHAT IS EASY TOPORT might be a rather good idea for

future cpu design.

If you quote now something that can run at 10x the clockspeed, thenthe question is of course: "suppose we wouldmake a big building filled with GaAs processing units, at what pricecan you build it me and what computing power does it

give at what power?

If the answer is: "the building might explode with odds 1 in amillion", i'm sure some governments want to take that riskif it is that much faster. See it as a feature. Ideal feature to sellto N*SA i'd argue.


The amount of power it uses is quite important IMHO.

Power should be ever more a bigger concern in highend HPC i feel.Right now it is paper demands from governments thatjust receive lied statements - i feel this is unsellable in future togovernment. The amount of watt a gflop matters quite a lot.

If it was that easy to produce energy, we would of course alreadyhave cars on electricity or drive on water.

Of course we want ECC and ECC ram on every design. Too many errors atsuch computing power is not acceptable.


Best Regards,
Vincent



On Jan 19, 2009, at 7:31 PM, Bill Broadley wrote:

John Hearns wrote:
BTW, re the discussion on processor frequency scaling,
what finally did happen to Emitter Coupled Logic and galliumarsenide?
I followed the exponential "intel killer" for quite some time,although itseemed obvious to me from the first slides it was going to be afailure. Skyhigh clock rates, tiny caches, and a poor memory buss seemed to bedestined
for failure.
If gallium arsenide or some other material gave us 10x the clockrate perwatt, but 1/2 the transistors would it really matter? Seemed likeeven intelis begrudgingly admitting it's the memory bus, and finally thenehalem is
blessed with dramatically more bandwidth.
Seems like increasingly cores are turning latency limited workloads(for theparallel jobs of course) into bandwidth limited ones. Without amemory busthat allows for 10x the bandwidth it doesn't really seem like 10xthe clock
rate would be of particular use.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Nehalem and Shanghai code performance for our rzf example

Reply via email to