> > It's all about ultimate scalability.  Anybody with a moderate competence 
> > (certainly anyone on this
> list) could devise a scheme to use 1000 perfect processors that never fail to 
> do 1000 quanta of work
> in unit time.  It's substantially more challenging to devise a scheme to do 
> 1000 quanta of work in
> unit time on, say, 1500 processors with a 20% failure rate.  Or even in 
> 1.2*unit time.
> >
> 
> Just to be clear - I wasn't saying this was a bad idea. Scaling up to
> this size seems inevitable. I was just imagining the team of admins who
> would have to be working non-stop to replace dead processors!
> 
> I wonder what the architecture for this system will be like. I imagine
> it will be built around small multi-socket blades that are hot-swappable
> to handle this.



I think that you just anticipate the failures and deal with them.  It's 
challenging to write code to do this, but it's certainly a worthy objective. I 
can easily see a situation where the cost to replace dead units is so high that 
you just don't bother doing it: it's cheaper to just add more live ones to the 
"pool".
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to