At the same time, there are API (e.g. HTCondor) that do not assume successful communications or computation; they are used in large distributed computing projects (SETI@HOME, FOLDING@HOME, distributed.net (though I don't think they have a toolbox available)). For embarrassingly parallel workloads, they can be a good match; for tightly coupled workloads, not always.
Luc On 11/23/2012 5:19 AM, Justin YUAN SHI wrote: > The fundamental problem rests in our programming API. If you look at > MPI and OpenMP carefully, you will find that these and all others have > one common assumption: the application-level communication is always > successful. > > We knew full well that this cannot be true. > > Thus, the bigger the application we build, the higher probability of > failure. This should not be a surprise. > > Proposed fault tolerance methods, such as redundant execution, is > really like "borrow John to pay Paul" where both John and Paul are > personal friends. > > What we need is a true sustainable solution that can gain performance > and reliability at the same time as we up scale the application. > > This is NOT an impossible dream. The packet-switching network is a > living example of such an architecture. The missing piece in HPC > applications is the principle of statistic multiplexed computing. In > other words, the application architecture should be considered as a > whole in the design space, not a "glued" together piece using lower > layers with unsealed semantic "holes". The semantic "holes" between > the layers are the real evils for all our troubles. > > Our research exhibit (booth 3360) demonstrate a prototype data > parallel system using this idea. The Sustainable HPC Cloud Workshop at > the end of SC12 (Friday AM)) had one paper touching on this topic as > well. > > Justin > > > > On Thu, Nov 22, 2012 at 5:03 AM, Eugen Leitl <eu...@leitl.org> wrote: >> >> http://www.computerworld.com.au/article/442703/supercomputers_face_growing_resilience_problems/ >> >> Supercomputers face growing resilience problems >> >> As the number of components in large supercomputers grows, so does the >> possibility of component failure >> >> Joab Jackson (IDG News Service) >> >> 21 November, 2012 21:58 >> >> As supercomputers grow more powerful, they'll also grow more vulnerable to >> failure, thanks to the increased amount of built-in componentry. A few >> researchers at the recent SC12 conference, held last week in Salt Lake City, >> offered possible solutions to this growing problem. >> >> Today's high-performance computing (HPC) systems can have 100,000 nodes or >> more -- with each node built from multiple components of memory, processors, >> buses and other circuitry. Statistically speaking, all these components will >> fail at some point, and they halt operations when they do so, said David >> Fiala, a Ph.D student at the North Carolina State University, during a talk >> at SC12. >> >> The problem is not a new one, of course. When Lawrence Livermore National >> Laboratory's 600-node ASCI (Accelerated Strategic Computing Initiative) White >> supercomputer went online in 2001, it had a mean time between failures (MTBF) >> of only five hours, thanks in part to component failures. Later tuning >> efforts had improved ASCI White's MTBF to 55 hours, Fiala said. >> >> But as the number of supercomputer nodes grows, so will the problem. >> "Something has to be done about this. It will get worse as we move to >> exascale," Fiala said, referring to how supercomputers of the next decade are >> expected to have 10 times the computational power that today's models do. >> >> Today's techniques for dealing with system failure may not scale very well, >> Fiala said. He cited checkpointing, in which a running program is temporarily >> halted and its state is saved to disk. Should the program then crash, the >> system is able to restart the job from the last checkpoint. >> >> The problem with checkpointing, according to Fiala, is that as the number of >> nodes grows, the amount of system overhead needed to do checkpointing grows >> as well -- and grows at an exponential rate. On a 100,000-node supercomputer, >> for example, only about 35 percent of the activity will be involved in >> conducting work. The rest will be taken up by checkpointing and -- should a >> system fail -- recovery operations, Fiala estimated. >> >> Because of all the additional hardware needed for exascale systems, which >> could be built from a million or more components, system reliability will >> have to be improved by 100 times in order to keep to the same MTBF that >> today's supercomputers enjoy, Fiala said. >> >> Fiala presented technology that he and fellow researchers developed that may >> help improve reliability. The technology addresses the problem of silent data >> corruption, when systems make undetected errors writing data to disk. >> >> Basically, the researchers' approach consists of running multiple copies, or >> "clones" of a program, simultaneously and then comparing the answers. The >> software, called RedMPI, is run in conjunction with the Message Passing >> Interface (MPI), a library for splitting running applications across multiple >> servers so the different parts of the program can be executed in parallel. >> >> RedMPI intercepts and copies every MPI message that an application sends, and >> sends copies of the message to the clone (or clones) of the program. If >> different clones calculate different answers, then the numbers can be >> recalculated on the fly, which will save time and resources from running the >> entire program again. >> >> "Implementing redundancy is not expensive. It may be high in the number of >> core counts that are needed, but it avoids the need for rewrites with >> checkpoint restarts," Fiala said. "The alternative is, of course, to simply >> rerun jobs until you think you have the right answer." >> >> Fiala recommended running two backup copies of each program, for triple >> redundancy. Though running multiple copies of a program would initially take >> up more resources, over time it may actually be more efficient, due to the >> fact that programs would not need to be rerun to check answers. Also, >> checkpointing may not be needed when multiple copies are run, which would >> also save on system resources. >> >> "I think the idea of doing redundancy is actually a great idea. [For] very >> large computations, involving hundreds of thousands of nodes, there certainly >> is a chance that errors will creep in," said Ethan Miller, a computer science >> professor at the University of California Santa Cruz, who attended the >> presentation. But he said the approach may be not be suitable given the >> amount of network traffic that such redundancy might create. He suggested >> running all the applications on the same set of nodes, which could minimize >> internode traffic. >> >> In another presentation, Ana Gainaru, a Ph.D student from the University of >> Illinois at Urbana-Champaign, presented a technique of analyzing log files to >> predict when system failures would occur. >> >> The work combines signal analysis with data mining. Signal analysis is used >> to characterize normal behavior, so when a failure occurs, it can be easily >> spotted. Data mining looks for correlations between separate reported >> failures. Other researchers have shown that multiple failures are sometimes >> correlated with each other, because a failure with one technology may affect >> performance in others, according to Gainaru. For instance, when a network >> card fails, it will soon hobble other system processes that rely on network >> communication. >> >> The researchers found that 70 percent of correlated failures provide a window >> of opportunity of more than 10 seconds. In other words, when the first sign >> of a failure has been detected, the system may have up to 10 seconds to save >> its work, or move the work to another node, before a more critical failure >> occurs. "Failure prediction can be merged with other fault-tolerance >> techniques," Gainaru said. >> >> Joab Jackson covers enterprise software and general technology breaking news >> for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's >> e-mail address is joab_jack...@idg.com >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf