Quoting Joe Landman <[EMAIL PROTECTED]>, on Thu 28 Feb 2008 05:20:01 AM PST:

Bill Broadley wrote:
The problem with many (cores|threads) is that memory bandwidth wall. A fixed size (B) pipe to memory, with N requesters on that pipe ...

What wall? Bandwidth is easy, it just costs money, and not much at that. Want 50GB/sec[1] buy a $170 video card. Want 100GB/sec... buy a

Heh... if it were that easy, we would spend extra on more bandwidth for
Harpertown and Barcelona ...

The point is that the design determines your hard/fixed per socket
limits, and no programming technique is going to get you around that
limit per socket.  You need to change your programming technique to go
many socket.  That limit is the bandwidth wall.


And this is much the same as the earlier discussions on this list, when folks were building 8 and 16 processor clusters. There, the bandwidth wall was the 10Mbps Ethernet interconnect, first through a hub, then a switch, etc.

This is sort of why any programming technique for speed up that relies on tight coupling (e.g. shared memory) can't scale infinitely. At some point, the speed of light and physical size conspire to do you in.

If one wanted to design revolutionary distributed/parallel computing algorithms, one could probably work with floppy disks and sneakernet. If it works there, it will certainly work on any faster mechanism. See.. true computer science doesn't need a 1000 processor cluster.

Another cluster related computer science issue is to start dealing with unreliable links between the nodes of the cluster. The overwhelming majority of cluster codes assume that message passing is perfect and has no errors. Sometimes this is provided transparently by the communications mechanism (i.e. TCP/IP promises in order, error-free delivery). However, in the TCP case that comes at a cost.. the latency isn't constant (because it achieves reliability by temporal redundancy:retries), and if your algorithm does some sort of scatter/gather and needs barrier synchronization, a late packet on one link brings the whole mass to a halt.

As data rates get higher, even really good bit error rates on the wire get to be too big. Consider this.. a BER of 1E-10 is quite good, but if you're pumping 10Gb/s over the wire, that's an error every second. (A BER of 1E-10 is a typical rate for something like 100Mbps link...). So, practical systems use some sort of FEC, but even with that, BERs of 1E-14 or 1E-15 are pretty much state of the art over shortish (meters) distances. (It's a power/signal to noise ratio thing..How much energy can you put into sending one bit of information?)

Jim

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to