Thanks for the kind words and comments! Good catch with HPL. It's definitely part of the test regime. I typically run 3 tests for consistency:
- Separate instance of STREAM2 on each node - Separate instance of HPL on each node - Simple MPI latency / bandwidth test called mpisweep that tests every link (I'll put this up on github later as well) I now made the changes to the document. After this set of tests I'm not completely sure if NPB will add any further information. Those 3 benchmarks combined with the other checks should pretty much expose all the possible issues. However, I could be missing something again :) Best regards, O-P -- Olli-Pekka Lehto Development Manager Computing Platforms CSC - IT Center for Science Ltd. E-Mail: olli-pekka.le...@csc.fi Tel: +358 50 381 8604 skype: oplehto // twitter: ople > From: "Jeffrey Layton" <layto...@gmail.com> > To: "Olli-Pekka Lehto" <olli-pekka.le...@csc.fi> > Cc: beowulf@beowulf.org > Sent: Tuesday, 22 March, 2016 16:45:20 > Subject: Re: [Beowulf] Cluster consistency checks > Olli-Pekka, > Very nice - I'm glad you put a list down. Many of the things that I do are > based > on experience. > A long time ago, in one of my previous jobs, we used to run NAS Parallel > Benchmark (NPB) on single nodes to get a baseline of performance. We would > look > for outliers and triage and debug them based on these results. We're not > running the test for performance but to make sure the cluster was a > homogeneous > as possible. Have you done this before? > I've also seen people run HPL on single nodes and look for outliers. After > triaging these, HPL is run on smaller groups of nodes within a single switch, > look for outliers and triage them. This continues up to the entire system. The > point is not to get a great HPL number to submit to the Top500 but rather to > find potential network issues, particularly network links. > Thanks for the good work! > Jeff > On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto < olli-pekka.le...@csc.fi > > wrote: >> Hi, >> I finally got around to writing down my cluster-consistency checklist that >> I've >> been planning for a long time: >> https://github.com/oplehto/cluster-checks/ >> The goal is to try to make the baseline installation of a cluster as >> consistent >> as possible and make vendors work for their money. :) Of course hopefully >> publishing this will help vendors capture some of the issues that slip >> through >> the cracks even before clusters are handed over. It's also a good idea to run >> these types of checks during the lifetime of the system as there's always >> some >> consistency creep as hardware gets replaced. >> If someone is interested in contributing, pull requests or comments on the >> list >> are welcome. I'm sure that there's something missing as well. Right now it's >> just a text-file but making some nicer scripts and postprocessing for the >> output might happen as well at some point. All the examples are very HP >> oriented as well at this point. >> Best regards, >> Olli-Pekka >> -- >> Olli-Pekka Lehto >> Development Manager >> Computing Platforms >> CSC - IT Center for Science Ltd. >> E-Mail: olli-pekka.le...@csc.fi >> Tel: +358 50 381 8604 >> skype: oplehto // twitter: ople >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf