Jim, your microcontroller cluster is not a rather good idea. Latency didn't keep up with the CPU speeds...
Todays nodes have a CPU core or 12 and soon 16 which can execute, let's take a simple integer example in my chessprogram and its IPC, about 24 instructions per cycle So nothing SIMD, just simple integer instructions most of it, of course loads which effectively come from L1 play an overwhelming role there. typical latencies to do a random memory read from the remote nodes, even with the latest networks, it's between 0.85 and 1.9 microseconds. Let's take optimistic 1 microsecond. RDMA read... So in that timeframe you can execute 24k+ instructions. IPC at the cheapo cpu's is far under 1 effectively. Around 0.25 for most codes. Cpu's of 70Mhz can execute 1 instruction in each 280 Mhz. Now we are busy with rough measures here. Let's call that 1/4 millisecond. Even USB 1.1 has to sticks latencies far under 1 millisecond. So actual latency of todays clusters is factor 25k worse than this 'cluster'. In fact your microcontrollercluster here has latencies that you do not even have core to core within a single CPU today. There is still too much years 80s and years 90s software out there, written by the guys who wrote books about how to parallellize, which simply doesn't scale at all at modern hardware. Let me not quote too many names there as i've done before. They were just too lazy to throw away their old code and start over new writing a new parallel concept that works at todays hardware. If we involve GPU's now then there is gonna be an even bigger problem and that's that bandwidth of the network can't keep up with what a single GPU delivers. Who is to blame for that is quite a complicated discussion, if anyone has to be blamed anyway. We just need more clever algorithms there. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf