Hi,

I'm looking for any suggestions people might have on performance characterising a HPC application (how's that for a broad query :)

Background:
We have a 20 node opteron 270 (2.0GHz dual core, 4GB ram, diskless) cluster with gigabit ethernet interconnect. It is used primarily to run an Oceanography numerical model called ROMS (http://www.myroms.org/ in case anyone is interested). The nodes are running Debian GNU/Linux Etch (AMD64 version) and we're using the portland group fortan90 compiler and mpich2 for our MPI needs. The cluster has been in production mode pretty much since it was commissioned so I haven't gotten a chance to do much tuning and benchmarking.

I'm currently trying to characterise the performance of the model, in particular to determine where it is

1. processor bound.

2. memory bound.

3. interconnect bound.

4. headnode bound.

I'm curious about how others go about this kind of characterisation - I'm not at all familiar with the model at a code level (my expertise, if any!, is in the area of Linux and hardware rather than in fortran90 code) so I don't have any particular insights from that perspective. I'm hoping I can characterise the app from outside using various measurement tools.

So far, I've used a mix of things including Ganglia, htop, iostat, vmtstat, wireshark, ifstat (and a few others) to try and get a picture of how the app behaves when running. One of my problems is having too much data to analyse and not being entirely certain what is significant and what isn't.

So far I've seen the following characteristics,

On the head node:
* Memory usage is pretty constant at about 1GB while the model is running. An additional 2-3GB is used in memory buffers and memory caches, presumably because this node does a lot of I/O. * Network traffic in averages at about 40 Mbit/sec but peaks to about 940 Mbit/sec (I was surprised by this - I didn't think gigabit was capable of even approaching this in practice, is this figure dubious or are bursts at this speed possible on good Gigabit hardware?). Network traffic out averages about 35 Mbit/sec but peaks to about 200Mbit/sec. The peaks are very short (maybe a few seconds in duration, presumably at the end of an MPI "run" if that is the correct term). * Processor usage averages about 25% but if I watch htop activity for a while I see bursts of 80-90% user activity on each core so the average is misleading.

On a compute node:
* Memory usage is pretty constant at about 700MB while the model is running with very little used in buffers or caches. * Network traffic in averages at about 50 Mbit/sec but peaks to about 200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks to about 200Mbit/sec. The peaks are very short (maybe a few seconds in duration, presumably at the end of an MPI "run" if that is the correct term). * Processor usage averages about 20% but if I watch htop activity for a while I see bursts of 50-60% user activity on each core so the average is misleading.

I'm inclined to install sar on these nodes and run it for a while - although again I'm wary about generating lots of performance data if I'm not sure what I'm looking for. I'm also a little wary of some of the RRD based tools which (for space-saving reasons) seem to do a lot of averaging which may actually hide information about bursts. Given that the model run here seems to be quite bursty I think that peak information is important.

I'm still unsure what the bottleneck currently is. My hunch is that a faster interconnect *should* give a better performance but I'm not sure how to quantify that. Do others here running MPI jobs see big improvements in using Infiniband over Gigabit for MPI jobs or does it really depend on the characteristics of the MPI job? What characteristics should I be looking for?

The goals of this characterisation exercise are two-fold,

a) to identify what parts of the system any tuning exercises should focus on. - some possible low hanging fruit includes enabling jumbo frames [some rough calculations suggest that we have 2 sizes of MPI messages, one at 40k and one at 205k ... use of jumbo frames should significantly reduce the number of packets to transmit a message, but would the gains be significant?]. - Do people here normally tune the tcp/ip stack? My experience is that it is very easy to reduce the performance by trying to tweak kernel buffer sizes due to the trade-offs in memory ... and 2.6 Linux kernels should be reasonably smart about this. - Have people had much success with bonding and gigabit or is there significant overheads in bonding?

b) to allow us to specify a new cluster which will run the model *faster*!
- from a perusal of past postings it sounds like current Opterons lag current Xeons in raw numeric performance (but only by a little) but that the memory controller architecture of Opterons give them an overall performance edge in most typical HPC loads, is that a correct 36,000ft summary or does it still depend very much on the application?

I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters available through their developer program for testing. Has anyone actually used these? It sounds like what we really need before spec'ing a new system is to list our assumptions and then go and test them on some similar hardware - these clusters would seem to offer an ideal environment for doing that but I'm wondering, in practice, how many hoops one has to jump through to avail of them ... and whether parties from outside of the US are even allowed access to these.

Apologies for the long-winded email but all feedback welcome. I'll be happy to summarise any off-list comments back to the list,

-stephen
--
Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
   GMIT, Dublin Rd, Galway, Ireland.      http://www.aplpi.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to