> > It's all about ultimate scalability. Anybody with a moderate competence > > (certainly anyone on this > list) could devise a scheme to use 1000 perfect processors that never fail to > do 1000 quanta of work > in unit time. It's substantially more challenging to devise a scheme to do > 1000 quanta of work in > unit time on, say, 1500 processors with a 20% failure rate. Or even in > 1.2*unit time. > > > > Just to be clear - I wasn't saying this was a bad idea. Scaling up to > this size seems inevitable. I was just imagining the team of admins who > would have to be working non-stop to replace dead processors! > > I wonder what the architecture for this system will be like. I imagine > it will be built around small multi-socket blades that are hot-swappable > to handle this.
I think that you just anticipate the failures and deal with them. It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool". _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf