Re: [Beowulf] Infiniband modular switches

Patrick Geoffray Thu, 26 Jun 2008 21:59:17 -0700

Gilad Shainer wrote:

Not only that I was there, but also had conversations afterwards. It is
a really "fair" comparison when you have different injection
rate/network capacity parameters. You can also take 10Mb and inject it
into 10Gb/s network to show the same, and you always can create the
network pattern to show what you want to show, but you prove nothing

The injection rate is irrelevant in these tests and the network patternis well defined: *random* pairwise exchange. In both cases (IB andQuadrics in the slides), the fabric is full bisection, ie there areenough links in the network to support the aggregate traffic of allports. The test consists in measuring the MPI bandwidth between randompair of nodes simultaneously.

Logically, you would expect to reach the full bandwidth between allpairs, because there are enough links in the fabric to support thistraffic. If you measure each pair independently, you will always get thelink rate, no problem. However, if you measure them simultaneously, youwill have contention: a few pairs may still reach full bandwidth butmost will only get a fraction of it. You can measure the min, max andaverage of the bandwidth between these pairs for a large number ofdifferent pairs to evaluate the efficiency of the routing.

The link bandwidth (injection rate) is irrelevant because the resultsare normalized (efficiency). What the slides show is that the efficiencyof Quadrics is better (the average bandwidth is higher despite a lowerlink bandwidth) and the bandwidth distribution is very narrow forQuadrics (spread between min and max pairwise bandwidth). This is adirect result of adaptive routing in Quadrics vs static routing in IB.Woven Systems reported similar results at Sandia using adaptive routingin Ethernet vs static routing in IB.

With static routing, you can find *one* set of routes that will providefull bandwidth between all pairs for a given set of pairs. If you changethe set of pairs without changing the set of routes, then you will getmuch less than full bandwidth. In average, if you measure with enoughrandom set of pairs, you will get an aggregate efficiency of ~40% withstatic routing, on several interconnects using full bisection topologies(Clos or Fat Tree), single virtual channel, wormhole switching andstatic routing. It has nothing to do with link rate, it is due toHead-of-Line (HOL) blocking:http://en.wikipedia.org/wiki/Head-of-line_blocking

here. I am not favor of static routing only or adaptive routing only,
and having both options is the most flexible solution.

It's not as simple as that. If you have a cluster that will run multiplejobs, most likely at the same time, which routing do you use ? If youuse static routing, efficiency may be good for one job, and bad foranother. Worse, the efficiency will change if I run the same job ondifferent nodes, or depending on what other job is running at the sametime on the cluster. If you use adaptive routing, efficiency will mostlikely be higher (maybe not by much) but, more important, it will bemore deterministic. Determinism means less load unbalance, predictabletime to completion, higher job throughput.

So far, IB only used static routing. If it still relies on packet orderon the wire for a given Queue Pair, then the only way to do some sort ofadaptive routing is to use a different QP for each possible route (LID).This is what Panda's group tried in a paper. However, the number of QPexplodes, each QP is still subject to HOL blocking and the QPinterleaving is static.

You can see that the worst case static routing goes quicklybelow 40%, but the average eventually goes there as well.
So what is your proof point here? I am sure you will find many cases
that static routing will do better (definitely on other interconnects)
and cases for adaptive routing.

No, static routing is static routing, on all interconnects. There is nomagic here, HOL blocking applies to everybody. My point is that under*random* structured patterns (such as pairwise exchange), static routingsucks. There are no other cases of random, it's just random.

If you want to argue that structured traffic patterns across multiplejobs running simultaneously on the same fabric are not equivalent torandom structured traffic, then this will go nowhere.


Patrick
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Infiniband modular switches

Reply via email to