> > > On 23/03/16 14:55, Douglas Eadline wrote: >>> Thanks for the kind words and comments! Good catch with HPL. It's >>> definitely part of the test regime. I typically run 3 tests for >>> consistency: >>> >>> - Separate instance of STREAM2 on each node >>> - Separate instance of HPL on each node >>> - Simple MPI latency / bandwidth test called mpisweep that tests every >>> link (I'll put this up on github later as well) > > Any reference to mpisweep yet? > > Google didn't give me much... > >>> >>> I now made the changes to the document. >>> >>> After this set of tests I'm not completely sure if NPB will add any >>> further information. Those 3 benchmarks combined with the other checks >>> should pretty much expose all the possible issues. However, I could be >>> missing something again :) >> NAS will verify the results. On several occasion I have >> found NAS gave good numbers but the results did not verify. >> This allowed me to look at lower level issues until I found >> the problem (in one case a cable IIRC) >> >> BTW, I run NAS all the time to test performance and make sure >> things are running properly on my deskside clusters. I have done >> it so often I can tell which test is running by watching wwtop >> (Warewulf cluster based top that shows loads, net, memory but no >> application names). > > Isn't it time someone puts together all of these nice tests in a GitHub > repo, or least some scripts/framework around each of these to > build/install/run/verify them with as minimal effort as possible? > > I already know your answer: "why don't you?". > Well, I may, some day, but who want want to help out? Any brave souls? > >
Along time ago on a cluster far far away... http://www.clustermonkey.net/Benchmarking-Methods/a-tool-for-cluster-performance-tuning-and-optimization.html I have not given this any attention in recent years. I wanted to add more tests and clean up the code. I do use my NAS script (that also needs cleaning up) to run multiple MPI/compiler/nodes/size combination all the time. It contains a lot of historical cruft. -- Doug > K. >> >> -- >> Doug >> >>> Best regards, >>> O-P >>> -- >>> Olli-Pekka Lehto >>> Development Manager >>> Computing Platforms >>> CSC - IT Center for Science Ltd. >>> E-Mail: olli-pekka.le...@csc.fi >>> Tel: +358 50 381 8604 >>> skype: oplehto // twitter: ople >>> >>>> From: "Jeffrey Layton" <layto...@gmail.com> >>>> To: "Olli-Pekka Lehto" <olli-pekka.le...@csc.fi> >>>> Cc: beowulf@beowulf.org >>>> Sent: Tuesday, 22 March, 2016 16:45:20 >>>> Subject: Re: [Beowulf] Cluster consistency checks >>>> Olli-Pekka, >>>> Very nice - I'm glad you put a list down. Many of the things that I do >>>> are based >>>> on experience. >>>> A long time ago, in one of my previous jobs, we used to run NAS >>>> Parallel >>>> Benchmark (NPB) on single nodes to get a baseline of performance. We >>>> would look >>>> for outliers and triage and debug them based on these results. We're >>>> not >>>> running the test for performance but to make sure the cluster was a >>>> homogeneous >>>> as possible. Have you done this before? >>>> I've also seen people run HPL on single nodes and look for outliers. >>>> After >>>> triaging these, HPL is run on smaller groups of nodes within a single >>>> switch, >>>> look for outliers and triage them. This continues up to the entire >>>> system. The >>>> point is not to get a great HPL number to submit to the Top500 but >>>> rather to >>>> find potential network issues, particularly network links. >>>> Thanks for the good work! >>>> Jeff >>>> On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto < >>>> olli-pekka.le...@csc.fi > >>>> wrote: >>>>> Hi, >>>>> I finally got around to writing down my cluster-consistency checklist >>>>> that I've >>>>> been planning for a long time: >>>>> https://github.com/oplehto/cluster-checks/ >>>>> The goal is to try to make the baseline installation of a cluster as >>>>> consistent >>>>> as possible and make vendors work for their money. :) Of course >>>>> hopefully >>>>> publishing this will help vendors capture some of the issues that >>>>> slip >>>>> through >>>>> the cracks even before clusters are handed over. It's also a good >>>>> idea >>>>> to run >>>>> these types of checks during the lifetime of the system as there's >>>>> always some >>>>> consistency creep as hardware gets replaced. >>>>> If someone is interested in contributing, pull requests or comments >>>>> on >>>>> the list >>>>> are welcome. I'm sure that there's something missing as well. Right >>>>> now >>>>> it's >>>>> just a text-file but making some nicer scripts and postprocessing for >>>>> the >>>>> output might happen as well at some point. All the examples are very >>>>> HP >>>>> oriented as well at this point. >>>>> Best regards, >>>>> Olli-Pekka >>>>> -- >>>>> Olli-Pekka Lehto >>>>> Development Manager >>>>> Computing Platforms >>>>> CSC - IT Center for Science Ltd. >>>>> E-Mail: olli-pekka.le...@csc.fi >>>>> Tel: +358 50 381 8604 >>>>> skype: oplehto // twitter: ople >>>>> _______________________________________________ >>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>>> Computing >>>>> To change your subscription (digest mode or unsubscribe) visit >>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>> -- >>> Mailscanner: Clean >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> -- >> Doug >> > > > -- > Mailscanner: Clean > -- Doug -- Mailscanner: Clean _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf