2012/9/24 Justin YUAN SHI <s...@temple.edu>: > I think the Redundant Memory paper was really mis-configured. It uses > a storage solution, trying to solve a volatle memory problem but > insisting on eliminating volatility. It looks very much messed up.
http://thebrainhouse.ch/gse/silvio/74.GSE/Silvio's%20Corner%20Doc%20Jukebox/System%20z%20Redundant%20Array%20of%20Independent%20Memory.pdf Maybe this paper is better. It explains the implementation of RAIM into the newish IBM systemZ. > My early comment on the OSI model still stands, even though MPI > implementation is far down the stack that may not fit the OSI model > well. The MPI implementation, even at the transport layer does NOT > re-transmit messages. I dont think you can even begin to apply tech like Infiniband or Fibrechannel to the OSI model. TCP does not really fit on the OSI model either. It was part of a standards framework developed by some weird ISO sub group back in the mid 80s for an application stack that was never used. People have then kinda munged together OSI and TCP and other application stuff to make some horrific stupid mess that should be consigned to a history book. Ever heard of FTAM, X.400 or CMIP? >When machine hangs running MPI protocol stack, the entire app hangs. this is >the root cause for all our fault tolerance problems. Im pretty sure faulty hardware is the root cause of out fault tolerance problems :). In any case the main issue seems to be the loss of a chunk of your application memory when the node fail not so much the retransmission of messages. MPI has some kind of functionality inside to address fault tolerance anyway. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.7837&rep=rep1&type=pdf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf