> Thanks for the kind words and comments! Good catch with HPL. It's > definitely part of the test regime. I typically run 3 tests for > consistency: > > - Separate instance of STREAM2 on each node > - Separate instance of HPL on each node > - Simple MPI latency / bandwidth test called mpisweep that tests every > link (I'll put this up on github later as well) > > I now made the changes to the document. > > After this set of tests I'm not completely sure if NPB will add any > further information. Those 3 benchmarks combined with the other checks > should pretty much expose all the possible issues. However, I could be > missing something again :)
NAS will verify the results. On several occasion I have found NAS gave good numbers but the results did not verify. This allowed me to look at lower level issues until I found the problem (in one case a cable IIRC) BTW, I run NAS all the time to test performance and make sure things are running properly on my deskside clusters. I have done it so often I can tell which test is running by watching wwtop (Warewulf cluster based top that shows loads, net, memory but no application names). -- Doug > > Best regards, > O-P > -- > Olli-Pekka Lehto > Development Manager > Computing Platforms > CSC - IT Center for Science Ltd. > E-Mail: olli-pekka.le...@csc.fi > Tel: +358 50 381 8604 > skype: oplehto // twitter: ople > >> From: "Jeffrey Layton" <layto...@gmail.com> >> To: "Olli-Pekka Lehto" <olli-pekka.le...@csc.fi> >> Cc: beowulf@beowulf.org >> Sent: Tuesday, 22 March, 2016 16:45:20 >> Subject: Re: [Beowulf] Cluster consistency checks > >> Olli-Pekka, > >> Very nice - I'm glad you put a list down. Many of the things that I do >> are based >> on experience. > >> A long time ago, in one of my previous jobs, we used to run NAS Parallel >> Benchmark (NPB) on single nodes to get a baseline of performance. We >> would look >> for outliers and triage and debug them based on these results. We're not >> running the test for performance but to make sure the cluster was a >> homogeneous >> as possible. Have you done this before? > >> I've also seen people run HPL on single nodes and look for outliers. >> After >> triaging these, HPL is run on smaller groups of nodes within a single >> switch, >> look for outliers and triage them. This continues up to the entire >> system. The >> point is not to get a great HPL number to submit to the Top500 but >> rather to >> find potential network issues, particularly network links. > >> Thanks for the good work! > >> Jeff > >> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto < >> olli-pekka.le...@csc.fi > >> wrote: > >>> Hi, > >>> I finally got around to writing down my cluster-consistency checklist >>> that I've >>> been planning for a long time: > >>> https://github.com/oplehto/cluster-checks/ >>> The goal is to try to make the baseline installation of a cluster as >>> consistent >>> as possible and make vendors work for their money. :) Of course >>> hopefully >>> publishing this will help vendors capture some of the issues that slip >>> through >>> the cracks even before clusters are handed over. It's also a good idea >>> to run >>> these types of checks during the lifetime of the system as there's >>> always some >>> consistency creep as hardware gets replaced. > >>> If someone is interested in contributing, pull requests or comments on >>> the list >>> are welcome. I'm sure that there's something missing as well. Right now >>> it's >>> just a text-file but making some nicer scripts and postprocessing for >>> the >>> output might happen as well at some point. All the examples are very HP >>> oriented as well at this point. > >>> Best regards, >>> Olli-Pekka >>> -- >>> Olli-Pekka Lehto >>> Development Manager >>> Computing Platforms >>> CSC - IT Center for Science Ltd. >>> E-Mail: olli-pekka.le...@csc.fi >>> Tel: +358 50 381 8604 >>> skype: oplehto // twitter: ople > >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Mailscanner: Clean > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug -- Mailscanner: Clean _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf