-----Original Message-----
From: Douglas Eadline [mailto:deadl...@eadline.org] 
Sent: Thursday, January 12, 2012 8:49 AM
To: Lux, Jim (337C)
Cc: beowulf@beowulf.org
Subject: Re: [Beowulf] A cluster of Arduinos

snip
>
>
> For my own work, I'd rather have people who are interested in solving 
> problems by ganging up multiple failure prone processors, rather than 
> centralizing it all in one monolithic box (even if the box happens to 
> have multiple cores).
>

This is going to be an exascale issue. i.e. how to compute on a systems whose 
parts might be in a constant state of breaking. An other interesting question 
is how do you know you are getting the right answer on a *really* large system?

Of course I spend much of my time optimizing really small systems.

--

Your point about scaling is well taken.. so far, the computing world has 
largely dealt with things by trying to make the processor perfect and error 
free.  Some limited areas of error correction are popular (RAM).  But think in 
a bigger area... say your arithmetic unit has some infrequent unknown errors 
(e.g. FDIV bug on Pentium).. could clever algorithm design and multiple 
processors (or multi cores) mitigate this (e.g. instead of just computing  Z = 
X/Y you also compute Z1 = (X*2)/(Y*2).. and compare answers... that exact 
example's not great because you've added 2 operations, but I can see that there 
are other clever techniques that might be possible.. )  

What is nice if you can do things like temporal redundancy (do the calculation 
twice, and if it's different, do it a third time), or even better some sort of 
"check calculation" that takes small time compared to mainline calculation.

This, I think, is somewhere that even the big iron/cluster folks could be doing 
some research.  What are optimum communication fabrics to support this kind of 
"side calculation" which may have different communication patterns and data 
flow than the "mainline".  It has a parallel in things like CRC checks in 
communications protocols.  A lot of hardware has a dedicated little CRC checker 
that is continuously calculating the CRC as the bits arrive, so that when you 
get to the end of the frame, the answer is already there.  


And Doug, your small systems have a lot of the same issues, perhaps because 
that small Limulus might be operated in environments other than what the 
underlying hardware was designed for.  I know people who have been rudely 
surprised when they found that the design environment for a laptop is a pretty 
narrow temperature range (e.g. office desktop) and when they put them in a car, 
subject to 0C or 40C temperatures, if not wider, that things don't work quite 
as well as expected.

Very small systems (few nodes) have the same issues, in some environments (e.g. 
a cluster subject to single event upsets or functional interrupts in a high 
radiation environment with a lot of high energy charged particles. it's not so 
much a total dose thing, but a SEE thing)

For Juno (which is in polar orbit around Jupiter), we shielded everything in a 
vault (a 1 meter cube with 1cm thick titanium walls) and still it's an issue.  
We don't get very long before everything is cooked. 

And I think that a non-trivially small cluster (e.g. more than 4 nodes, I 
think) you could do a lot of experimentation on techniques.


(oddly, simulated fault injection is one of the trickier parts)
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to