Re: [Beowulf] Opinions of Hyper-threading?

Vincent Diepeveen Thu, 28 Feb 2008 09:32:59 -0800


On Feb 28, 2008, at 4:33 PM, Mark Hahn wrote:

The problem with many (cores|threads) is that memory bandwidthwall. A fixed size (B) pipe to memory, with N requesters onthat pipe ...
What wall? Bandwidth is easy, it just costs money, and not muchat that. Want 50GB/sec[1] buy a $170 video card. Want 100GB/sec... buy a
Heh... if it were that easy, we would spend extra on morebandwidth for
Harpertown and Barcelona ...
I think the point is that chip vendors are not talking about mere
doubling of number of cores, but (apparently with straight faces),
things like 1k GP cores/chip.
personally, I think they're in for a surprise - that there isn't avast
market for more than 2-4 cores per chip.

Microsoft might give a helping hand there by writing their ownsoftware more user friendly,

requiring somehow heavier processors :)

limits, and no programming technique is going to get you around that
limit per socket. You need to change your programming techniqueto go
many socket.  That limit is the bandwidth wall.
IMO, this is the main fallacy behind the current industry harangue.
the problem is _NOT_ that programmers are dragging their feet, butrather some combination of amdahl's law and the low average _inherent_parallelism of computation. (I'm _not_ talking about MC orgraphics rendering here, but today's most common computer uses: weband email.)
the manycore cart is being put before the horse. worse, no one hasreallyshown that manycore (and the presumed ccnuma model) is actuallyscalable to large values on "normal" workloads. (getting goodscaling for an AMCFD code on 128 cores in an Altix is kind of a differentproposition than
scaling to 128 cores in a single chip.)
as far as I know, all current examples of large ccnuma scaling arepremised on core:memory ratios of about 4:1 (4 it2 cores per bankof dram in an Altix, for instance.) I don't doubt that we canimprove memory bandwidth (and concurrency) per chip, but it's notan area-driven
process, so will never keep up.
so: do an exponential and a sublinear trend diverge? yes: meetmemory wall.
what's missing is a reason to think that basically all workloadscan be madecache-friendly enough to scale to 10e2 or 10e3 cores. I just don'tsee that.

Some fields which are overly represented in this mailing list justrequire more RAM rather than cpu.Just a few fields have embarrassingly parallel software that onlyneeds cpu power and not much of a RAM.

Most of them are either encryption or security related searches.

There is however a growing number of fields where communication speedbetween the processors is very important,

not so much the bandwidth, but rather the latency.

As CPU's are that fast nowadays that algorithms can kick in wherebranching factor,practically that is the time needed to get 1 step or iteration in theprocess further,is heavily dependant upon communication speed between the processorsand especially

reusing data stored in the (huge) RAM of other memory-nodes.

In the long run of course many fields will converge to such types ofalgorithms, field after field isinventing algorithms like that, which is a logical consequence of theprogress in hardware.

Todays highend hardware ALLOWS complex algorithms to get invented,which IMHO is a good thing.


Now let's invent something that makes me coffee :)

it's really a memory-to-core issue: from what I see, the goalshould be something in the range of 1GB per core. there areexamples up to 10G/core


1GB a core are standards that most supercomputers had a year or 8 ago.

It's quite interesting to see how RAM and latency between cpu'shasn't kept up pace with cpu crunching power.

and down to 100M/core, but not really beyond that.  (except for stream
processing, which is great stuff but _cries_ for non-general-purpose HW.)
As data rates get higher, even really good bit error rates on thewire get to be too big. Consider this.. a BER of 1E-10 is quitegood, but if you're pumping 10Gb/s over the wire, that's an errorevery second. (A BER of 1E-10 is a typical rate for somethinglike 100Mbps link...). So, practical systems
I'm no expert, but 1e-10 seems quite high to me. the docs I foundabout 10G
requirements all specified 1e-12, and claimed to have achieved 1e15 in
realistic, long-range tests...

In the end the conclusion will be of course that we need a newerproces technology from ASML very badlyto get into production to produce CPU's with even more transistors inorder to push some of the problems to the 1GB L3 cache :)

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Opinions of Hyper-threading?

Reply via email to