> I would suggest that some scheme of redundant computation might be more > effective.. Rather than try to store a single node's state on the node, > and then, if any node hiccups, restore the state (perhaps to a spare), and > restart, means stopping the entire cluster while you recover. > > Or, if you can factor your computation to make use of extra processing > nodes, you can just keep on moving. Think of this as a higher level > scheme than, say, Hamming codes for memory protection: use 11 bits to > store 8, and you're still synchronous.
One similar avenue I have thought about is what I call dynamic redundancy. It requires a top level divide and conquer like approach where independent "parts" can fail without causing the others to fail, because the assumption is something will fail. Depending on the resource load you can dial up how much redundancy you want so that a range of the "parts" will be running redundantly when one or some of them fail, the others take over. At one end of the dial everything is redundant and execution is slower. At the other end nothing is redundant and execution is fastest. In between you would be betting that running every N parts redundantly will increase your odds of hitting a failure on a redundant "part." If you choose no redundancy, the program could end up waiting at communication points for the failed part to respawn, complete and then continue at the exchange point. Worst case would be failure just before a "parts" completion. If a "part" failed half way though its run, it would only be halfway behind the others and if everyone else is waiting, respawn the failed "part(s)" with redundancy to ensure they get done. You could also have schemes where the grain size of the parallelism could be used to adjust the redundancy. i.e. if there are idle resources then why not use them for redundancy just in case. Lots of interesting ways keep things moving if you run in a dynamic fashion. Furthermore, I think an Erlang like runtime system will be needed so that you can change code while the program is running. In general, I find this to be an interesting exercise - design parallel codes that have a range of messaging times, from almost instant to never. -- Doug --snipped the rest-- -- Mailscanner: Clean _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf