Quoting Mark Hahn <[EMAIL PROTECTED]>, on Thu 28 Feb 2008 07:33:07 AM PST:

The problem with many (cores|threads) is that memory bandwidth wall. A fixed size (B) pipe to memory, with N requesters on that pipe ...

What wall? Bandwidth is easy, it just costs money, and not much at that. Want 50GB/sec[1] buy a $170 video card. Want 100GB/sec... buy a

Heh... if it were that easy, we would spend extra on more bandwidth for
Harpertown and Barcelona ...

I think the point is that chip vendors are not talking about mere
doubling of number of cores, but (apparently with straight faces),
things like 1k GP cores/chip.

personally, I think they're in for a surprise - that there isn't a vast
market for more than 2-4 cores per chip.

Perhaps not today. But then, Thomas Watson said there wasn't a vast market for computers.. perhaps 5 world wide.

No question that folks will have to figure out how to effectively use all that parallelism. (e.g. each processor deals with one page of a Word document, or a range of Excel cells?). I can see a lot of fairly easily coded things dealing with rapid search (e.g. which of my documents have the word hyperthreading and Hahn in them). Right now, search and retrieval of unstructured data is a very computationally intensive task that millions of folks suffer through daily. (How many of you find Google over the web faster than Microsoft's "Search for File or Folder.." (or, greping the entire disk) on your local machine? )


And we cluster dweebs have a headstart on them... we've been dealing with figuring out how to spread problems that are too big to fit on one node across multiples for years now. After all, billg's programming fame is from a flood fill graphics algorithm, and look how well he's done with that <grin>.



limits, and no programming technique is going to get you around that
limit per socket.  You need to change your programming technique to go
many socket.  That limit is the bandwidth wall.

IMO, this is the main fallacy behind the current industry harangue.
the problem is _NOT_ that programmers are dragging their feet, but
rather some combination of amdahl's law and the low average _inherent_
parallelism of computation.  (I'm _not_ talking about MC or graphics
rendering here, but today's most common computer uses: web and email.)

Text search and retrieval is where it's at. almost 30 years ago I worked on developing a piece of office equipment the size of a 2 drawer filecabinet that would do just that, hooked up to a bunch of word processors (i.e. find me that letter we sent to John Smith).. It was expensive! It had a 80MB (or 160MB) disk drive (huge!), it could search thousands of pages in the blink of an eye. (called the OFISfile, sold by Burroughs) And people DID buy it. And, without giving away the internals, it could have made excellent use of a 1000 core type processor.

Granted, the googles of the world will (correctly) contend that an equally good solution is to have a good comm link to a centralized search and retrieval engine (doesn't even have to be that fast.. just comparable to the time it takes me to enter the request and read the results). But, they too can use parallelism.




the manycore cart is being put before the horse.  worse, no one has really
shown that manycore (and the presumed ccnuma model) is actually
scalable to large values on "normal" workloads.  (getting good scaling
for an AM
CFD code on 128 cores in an Altix is kind of a different proposition than
scaling to 128 cores in a single chip.)

To a certain extent it's an example of build it and they will come (to 10% of the things that are built, the other 90% are interesting blips left by the side of the road).

When compilers were introduced, I'm sure the skilled machine language coders said.. hmmph, we can do just fine with our octal and hex, there's no expressed demand for high level languages. (Kids..get offa my lawn!) Heck, the plugboard programmers on EAM equipment probably said that to the guys working with stored program computers. And before that, the supervisor of the computer pool probably said that to the plugboard guys, as he gazed over a room full of Marchand calculators with computers punching numbers and pulling the handles.



what's missing is a reason to think that basically all workloads can be made
cache-friendly enough to scale to 10e2 or 10e3 cores.  I just don't see that.

Not all workloads... just enough so that it forms a significant market. and text search and retrieval is a pretty big consumer of CPU cycles, in the big wide world (as opposed to the specialized world of large numeric simulations and the like that have historically been hosted on clusters)

Remember, the recurring cost is basically related to the size of the die, not what's on it. So, if there's a significant market for 10,000 processor widgets, they'll be made, and cheaply.

As data rates get higher, even really good bit error rates on the wire get to be too big. Consider this.. a BER of 1E-10 is quite good, but if you're pumping 10Gb/s over the wire, that's an error every second. (A BER of 1E-10 is a typical rate for something like 100Mbps link...). So, practical systems

I'm no expert, but 1e-10 seems quite high to me.  the docs I found about 10G
requirements all specified 1e-12, and claimed to have achieved 1e15 in
realistic, long-range tests...

That's probably the error rate above the PHY layer. I.e. after the forward error correction. And the 10G requirement is tighter than the 100Mbps requirement, just to make FEC possible with reasonable redundancy. Typically, you want a raw PHY BER at least 100x away from the data rate (e.g. 1E8 bps->1E-10 BER, 1E10 bps->1E-12 BER)



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to