[Beowulf] Performance characterising a HPC application

stephen mulcahy Fri, 16 Mar 2007 06:41:14 -0800

Hi,

I'm looking for any suggestions people might have on performancecharacterising a HPC application (how's that for a broad query :)


Background:

We have a 20 node opteron 270 (2.0GHz dual core, 4GB ram, diskless)cluster with gigabit ethernet interconnect. It is used primarily to runan Oceanography numerical model called ROMS (http://www.myroms.org/ incase anyone is interested). The nodes are running Debian GNU/Linux Etch(AMD64 version) and we're using the portland group fortan90 compiler andmpich2 for our MPI needs. The cluster has been in production mode prettymuch since it was commissioned so I haven't gotten a chance to do muchtuning and benchmarking.

I'm currently trying to characterise the performance of the model, inparticular to determine where it is


1. processor bound.

2. memory bound.

3. interconnect bound.

4. headnode bound.

I'm curious about how others go about this kind of characterisation -I'm not at all familiar with the model at a code level (my expertise, ifany!, is in the area of Linux and hardware rather than in fortran90code) so I don't have any particular insights from that perspective. I'mhoping I can characterise the app from outside using various measurementtools.

So far, I've used a mix of things including Ganglia, htop, iostat,vmtstat, wireshark, ifstat (and a few others) to try and get a pictureof how the app behaves when running. One of my problems is having toomuch data to analyse and not being entirely certain what is significantand what isn't.


So far I've seen the following characteristics,

On the head node:

* Memory usage is pretty constant at about 1GB while the model isrunning. An additional 2-3GB is used in memory buffers and memorycaches, presumably because this node does a lot of I/O.* Network traffic in averages at about 40 Mbit/sec but peaks to about940 Mbit/sec (I was surprised by this - I didn't think gigabit wascapable of even approaching this in practice, is this figure dubious orare bursts at this speed possible on good Gigabit hardware?). Networktraffic out averages about 35 Mbit/sec but peaks to about 200Mbit/sec.The peaks are very short (maybe a few seconds in duration, presumably atthe end of an MPI "run" if that is the correct term).* Processor usage averages about 25% but if I watch htop activity for awhile I see bursts of 80-90% user activity on each core so the averageis misleading.


On a compute node:

* Memory usage is pretty constant at about 700MB while the model isrunning with very little used in buffers or caches.* Network traffic in averages at about 50 Mbit/sec but peaks to about200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaksto about 200Mbit/sec. The peaks are very short (maybe a few seconds induration, presumably at the end of an MPI "run" if that is the correctterm).* Processor usage averages about 20% but if I watch htop activity for awhile I see bursts of 50-60% user activity on each core so the averageis misleading.

I'm inclined to install sar on these nodes and run it for a while -although again I'm wary about generating lots of performance data if I'mnot sure what I'm looking for. I'm also a little wary of some of the RRDbased tools which (for space-saving reasons) seem to do a lot ofaveraging which may actually hide information about bursts. Given thatthe model run here seems to be quite bursty I think that peakinformation is important.

I'm still unsure what the bottleneck currently is. My hunch is that afaster interconnect *should* give a better performance but I'm not surehow to quantify that. Do others here running MPI jobs see bigimprovements in using Infiniband over Gigabit for MPI jobs or does itreally depend on the characteristics of the MPI job? Whatcharacteristics should I be looking for?


The goals of this characterisation exercise are two-fold,

a) to identify what parts of the system any tuning exercises shouldfocus on.- some possible low hanging fruit includes enabling jumbo frames [somerough calculations suggest that we have 2 sizes of MPI messages, one at40k and one at 205k ... use of jumbo frames should significantly reducethe number of packets to transmit a message, but would the gains besignificant?].- Do people here normally tune the tcp/ip stack? My experience is thatit is very easy to reduce the performance by trying to tweak kernelbuffer sizes due to the trade-offs in memory ... and 2.6 Linux kernelsshould be reasonably smart about this.- Have people had much success with bonding and gigabit or is theresignificant overheads in bonding?


b) to allow us to specify a new cluster which will run the model *faster*!

- from a perusal of past postings it sounds like current Opterons lagcurrent Xeons in raw numeric performance (but only by a little) but thatthe memory controller architecture of Opterons give them an overallperformance edge in most typical HPC loads, is that a correct 36,000ftsummary or does it still depend very much on the application?

I notice that AMD (and Mellanox and Pathscale/Qlogic) have clustersavailable through their developer program for testing. Has anyoneactually used these? It sounds like what we really need before spec'inga new system is to list our assumptions and then go and test them onsome similar hardware - these clusters would seem to offer an idealenvironment for doing that but I'm wondering, in practice, how manyhoops one has to jump through to avail of them ... and whether partiesfrom outside of the US are even allowed access to these.

Apologies for the long-winded email but all feedback welcome. I'll behappy to summarise any off-list comments back to the list,


-stephen
--
Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
   GMIT, Dublin Rd, Galway, Ireland.      http://www.aplpi.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Performance characterising a HPC application

Reply via email to