On 23/03/16 14:55, Douglas Eadline wrote:
Thanks for the kind words and comments! Good catch with HPL. It's
definitely part of the test regime. I typically run 3 tests for
consistency:

- Separate instance of STREAM2 on each node
- Separate instance of HPL on each node
- Simple MPI latency / bandwidth test called mpisweep that tests every
link (I'll put this up on github later as well)

Any reference to mpisweep yet?

Google didn't give me much...


I now made the changes to the document.

After this set of tests I'm not completely sure if NPB will add any
further information. Those 3 benchmarks combined with the other checks
should pretty much expose all the possible issues. However, I could be
missing something again :)
NAS will verify the results. On several occasion I have
found NAS gave good numbers but the results did not verify.
This allowed me to look at lower level issues until I found
the problem (in one case a cable IIRC)

BTW, I run NAS all the time to test performance and make sure
things are running properly on my deskside clusters. I have done
it so often I can tell which test is running by watching wwtop
(Warewulf cluster based top that shows loads, net, memory but no
application names).

Isn't it time someone puts together all of these nice tests in a GitHub repo, or least some scripts/framework around each of these to build/install/run/verify them with as minimal effort as possible?

I already know your answer: "why don't you?".
Well, I may, some day, but who want want to help out? Any brave souls?


K.

--
Doug

Best regards,
O-P
--
Olli-Pekka Lehto
Development Manager
Computing Platforms
CSC - IT Center for Science Ltd.
E-Mail: olli-pekka.le...@csc.fi
Tel: +358 50 381 8604
skype: oplehto // twitter: ople

From: "Jeffrey Layton" <layto...@gmail.com>
To: "Olli-Pekka Lehto" <olli-pekka.le...@csc.fi>
Cc: beowulf@beowulf.org
Sent: Tuesday, 22 March, 2016 16:45:20
Subject: Re: [Beowulf] Cluster consistency checks
Olli-Pekka,
Very nice - I'm glad you put a list down. Many of the things that I do
are based
on experience.
A long time ago, in one of my previous jobs, we used to run NAS Parallel
Benchmark (NPB) on single nodes to get a baseline of performance. We
would look
for outliers and triage and debug them based on these results. We're not
running the test for performance but to make sure the cluster was a
homogeneous
as possible. Have you done this before?
I've also seen people run HPL on single nodes and look for outliers.
After
triaging these, HPL is run on smaller groups of nodes within a single
switch,
look for outliers and triage them. This continues up to the entire
system. The
point is not to get a great HPL number to submit to the Top500 but
rather to
find potential network issues, particularly network links.
Thanks for the good work!
Jeff
On Tue, Mar 22, 2016 at 11:32 AM, Olli-Pekka Lehto <
olli-pekka.le...@csc.fi >
wrote:
Hi,
I finally got around to writing down my cluster-consistency checklist
that I've
been planning for a long time:
https://github.com/oplehto/cluster-checks/
The goal is to try to make the baseline installation of a cluster as
consistent
as possible and make vendors work for their money. :) Of course
hopefully
publishing this will help vendors capture some of the issues that slip
through
the cracks even before clusters are handed over. It's also a good idea
to run
these types of checks during the lifetime of the system as there's
always some
consistency creep as hardware gets replaced.
If someone is interested in contributing, pull requests or comments on
the list
are welcome. I'm sure that there's something missing as well. Right now
it's
just a text-file but making some nicer scripts and postprocessing for
the
output might happen as well at some point. All the examples are very HP
oriented as well at this point.
Best regards,
Olli-Pekka
--
Olli-Pekka Lehto
Development Manager
Computing Platforms
CSC - IT Center for Science Ltd.
E-Mail: olli-pekka.le...@csc.fi
Tel: +358 50 381 8604
skype: oplehto // twitter: ople
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Mailscanner: Clean

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


--
Doug


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to