On Feb 28, 2008, at 4:33 PM, Mark Hahn wrote:
The problem with many (cores|threads) is that memory bandwidth
wall. A fixed size (B) pipe to memory, with N requesters on
that pipe ...
What wall? Bandwidth is easy, it just costs money, and not much
at that. Want 50GB/sec[1] buy a $170 video card. Want 100GB/
sec... buy a
Heh... if it were that easy, we would spend extra on more
bandwidth for
Harpertown and Barcelona ...
I think the point is that chip vendors are not talking about mere
doubling of number of cores, but (apparently with straight faces),
things like 1k GP cores/chip.
personally, I think they're in for a surprise - that there isn't a
vast
market for more than 2-4 cores per chip.
Microsoft might give a helping hand there by writing their own
software more user friendly,
requiring somehow heavier processors :)
limits, and no programming technique is going to get you around that
limit per socket. You need to change your programming technique
to go
many socket. That limit is the bandwidth wall.
IMO, this is the main fallacy behind the current industry harangue.
the problem is _NOT_ that programmers are dragging their feet, but
rather some combination of amdahl's law and the low average _inherent_
parallelism of computation. (I'm _not_ talking about MC or
graphics rendering here, but today's most common computer uses: web
and email.)
the manycore cart is being put before the horse. worse, no one has
really
shown that manycore (and the presumed ccnuma model) is actually
scalable to large values on "normal" workloads. (getting good
scaling for an AM
CFD code on 128 cores in an Altix is kind of a different
proposition than
scaling to 128 cores in a single chip.)
as far as I know, all current examples of large ccnuma scaling are
premised on core:memory ratios of about 4:1 (4 it2 cores per bank
of dram in an Altix, for instance.) I don't doubt that we can
improve memory bandwidth (and concurrency) per chip, but it's not
an area-driven
process, so will never keep up.
so: do an exponential and a sublinear trend diverge? yes: meet
memory wall.
what's missing is a reason to think that basically all workloads
can be made
cache-friendly enough to scale to 10e2 or 10e3 cores. I just don't
see that.
Some fields which are overly represented in this mailing list just
require more RAM rather than cpu.
Just a few fields have embarrassingly parallel software that only
needs cpu power and not much of a RAM.
Most of them are either encryption or security related searches.
There is however a growing number of fields where communication speed
between the processors is very important,
not so much the bandwidth, but rather the latency.
As CPU's are that fast nowadays that algorithms can kick in where
branching factor,
practically that is the time needed to get 1 step or iteration in the
process further,
is heavily dependant upon communication speed between the processors
and especially
reusing data stored in the (huge) RAM of other memory-nodes.
In the long run of course many fields will converge to such types of
algorithms, field after field is
inventing algorithms like that, which is a logical consequence of the
progress in hardware.
Todays highend hardware ALLOWS complex algorithms to get invented,
which IMHO is a good thing.
Now let's invent something that makes me coffee :)
it's really a memory-to-core issue: from what I see, the goal
should be something in the range of 1GB per core. there are
examples up to 10G/core
1GB a core are standards that most supercomputers had a year or 8 ago.
It's quite interesting to see how RAM and latency between cpu's
hasn't kept up pace with cpu crunching power.
and down to 100M/core, but not really beyond that. (except for stream
processing, which is great stuff but _cries_ for non-general-
purpose HW.)
As data rates get higher, even really good bit error rates on the
wire get to be too big. Consider this.. a BER of 1E-10 is quite
good, but if you're pumping 10Gb/s over the wire, that's an error
every second. (A BER of 1E-10 is a typical rate for something
like 100Mbps link...). So, practical systems
I'm no expert, but 1e-10 seems quite high to me. the docs I found
about 10G
requirements all specified 1e-12, and claimed to have achieved 1e15 in
realistic, long-range tests...
In the end the conclusion will be of course that we need a newer
proces technology from ASML very badly
to get into production to produce CPU's with even more transistors in
order to push some of the problems to the 1GB L3 cache :)
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf