> > > -----Original Message----- > From: Douglas Eadline [mailto:deadl...@eadline.org] > Sent: Thursday, January 12, 2012 8:49 AM > To: Lux, Jim (337C) > Cc: beowulf@beowulf.org > Subject: Re: [Beowulf] A cluster of Arduinos > > snip >> >> >> For my own work, I'd rather have people who are interested in solving >> problems by ganging up multiple failure prone processors, rather than >> centralizing it all in one monolithic box (even if the box happens to >> have multiple cores). >> > > This is going to be an exascale issue. i.e. how to compute on a systems > whose parts might be in a constant state of breaking. An other interesting > question is how do you know you are getting the right answer on a *really* > large system? > > Of course I spend much of my time optimizing really small systems. > > -- > > Your point about scaling is well taken.. so far, the computing world has > largely dealt with things by trying to make the processor perfect and > error free. Some limited areas of error correction are popular (RAM). > But think in a bigger area... say your arithmetic unit has some infrequent > unknown errors (e.g. FDIV bug on Pentium).. could clever algorithm design > and multiple processors (or multi cores) mitigate this (e.g. instead of > just computing Z = X/Y you also compute Z1 = (X*2)/(Y*2).. and compare > answers... that exact example's not great because you've added 2 > operations, but I can see that there are other clever techniques that > might be possible.. ) > > What is nice if you can do things like temporal redundancy (do the > calculation twice, and if it's different, do it a third time), or even > better some sort of "check calculation" that takes small time compared to > mainline calculation. > > This, I think, is somewhere that even the big iron/cluster folks could be > doing some research. What are optimum communication fabrics to support > this kind of "side calculation" which may have different communication > patterns and data flow than the "mainline". It has a parallel in things > like CRC checks in communications protocols. A lot of hardware has a > dedicated little CRC checker that is continuously calculating the CRC as > the bits arrive, so that when you get to the end of the frame, the answer > is already there. > > > And Doug, your small systems have a lot of the same issues, perhaps > because that small Limulus might be operated in environments other than > what the underlying hardware was designed for. I know people who have > been rudely surprised when they found that the design environment for a > laptop is a pretty narrow temperature range (e.g. office desktop) and when > they put them in a car, subject to 0C or 40C temperatures, if not wider, > that things don't work quite as well as expected.
I will be curious to see where these things show up since all you really need is a power plug. (a little nervous actually). > > Very small systems (few nodes) have the same issues, in some environments > (e.g. a cluster subject to single event upsets or functional interrupts in > a high radiation environment with a lot of high energy charged particles. > it's not so much a total dose thing, but a SEE thing) > > For Juno (which is in polar orbit around Jupiter), we shielded everything > in a vault (a 1 meter cube with 1cm thick titanium walls) and still it's > an issue. We don't get very long before everything is cooked. > > And I think that a non-trivially small cluster (e.g. more than 4 nodes, I > think) you could do a lot of experimentation on techniques. I agree. Four nodes is really small. BTW, the most fun in designing this system is a set of tighter constraints than are found on the typical cluster. Noise, power, space, cabling, low cost packaging, etc. I have been asked about a rack mount version, we'll see. One thing I find interesting is the core/node efficiency. (what I call "effective cores") In general *on some codes*, I found that less cores (1P micro-atx 4-cores) is more efficient than many cores (2P server 12-core). Seems obvious, but I like to test things. > > > (oddly, simulated fault injection is one of the trickier parts) > I would assume, because in a sense, the black swan* is by definition hard to predict. (* the book by Nick Taleb, not the movie) -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf