Hi, I do not know if i can help answering the original question really. but most of the failures we see from the system side are in that order
hard disks interconnect cards misconfigured node Uncorrected Memory errors system board failures Unexplainable failures failures related to the application itself we do not see them as the user will resubmit his job and will correct their mistakes quietly. The question is cluster by definition are not highly available systems, they are made up of commodity hardware, and if most of these clusters are using the standard mpi implementation then they will work on the principle if it fails stop. and in most of the time failure investigation is minimal as the importance is getting the node back to work. so is failure rate really of concern? if it was so we would see more of fault tolerance layers in clusters and failure rate metrics in monitoring tools and reports. I am interested in reducing these failure rates as user demands are growing instead of using few nodes, now they are using as much as possible and requesting for even more, and the more you give them, the more failures we will get! What will you be trying to achieve with your thesis? will the question of how the reduce or manage the failures be part of it? regards Walid. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf