Gilad Shainer wrote:
Not only that I was there, but also had conversations afterwards. It is
a really "fair" comparison when you have different injection
rate/network capacity parameters. You can also take 10Mb and inject it
into 10Gb/s network to show the same, and you always can create the
network pattern to show what you want to show, but you prove nothing

The injection rate is irrelevant in these tests and the network pattern is well defined: *random* pairwise exchange. In both cases (IB and Quadrics in the slides), the fabric is full bisection, ie there are enough links in the network to support the aggregate traffic of all ports. The test consists in measuring the MPI bandwidth between random pair of nodes simultaneously.

Logically, you would expect to reach the full bandwidth between all pairs, because there are enough links in the fabric to support this traffic. If you measure each pair independently, you will always get the link rate, no problem. However, if you measure them simultaneously, you will have contention: a few pairs may still reach full bandwidth but most will only get a fraction of it. You can measure the min, max and average of the bandwidth between these pairs for a large number of different pairs to evaluate the efficiency of the routing.

The link bandwidth (injection rate) is irrelevant because the results are normalized (efficiency). What the slides show is that the efficiency of Quadrics is better (the average bandwidth is higher despite a lower link bandwidth) and the bandwidth distribution is very narrow for Quadrics (spread between min and max pairwise bandwidth). This is a direct result of adaptive routing in Quadrics vs static routing in IB. Woven Systems reported similar results at Sandia using adaptive routing in Ethernet vs static routing in IB.

With static routing, you can find *one* set of routes that will provide full bandwidth between all pairs for a given set of pairs. If you change the set of pairs without changing the set of routes, then you will get much less than full bandwidth. In average, if you measure with enough random set of pairs, you will get an aggregate efficiency of ~40% with static routing, on several interconnects using full bisection topologies (Clos or Fat Tree), single virtual channel, wormhole switching and static routing. It has nothing to do with link rate, it is due to Head-of-Line (HOL) blocking: http://en.wikipedia.org/wiki/Head-of-line_blocking


here. I am not favor of static routing only or adaptive routing only,
and having both options is the most flexible solution.

It's not as simple as that. If you have a cluster that will run multiple jobs, most likely at the same time, which routing do you use ? If you use static routing, efficiency may be good for one job, and bad for another. Worse, the efficiency will change if I run the same job on different nodes, or depending on what other job is running at the same time on the cluster. If you use adaptive routing, efficiency will most likely be higher (maybe not by much) but, more important, it will be more deterministic. Determinism means less load unbalance, predictable time to completion, higher job throughput.

So far, IB only used static routing. If it still relies on packet order on the wire for a given Queue Pair, then the only way to do some sort of adaptive routing is to use a different QP for each possible route (LID). This is what Panda's group tried in a paper. However, the number of QP explodes, each QP is still subject to HOL blocking and the QP interleaving is static.


You can see that the worst case static routing goes quickly below 40%, but the average eventually goes there as well.


So what is your proof point here? I am sure you will find many cases
that static routing will do better (definitely on other interconnects)
and cases for adaptive routing.

No, static routing is static routing, on all interconnects. There is no magic here, HOL blocking applies to everybody. My point is that under *random* structured patterns (such as pairwise exchange), static routing sucks. There are no other cases of random, it's just random.

If you want to argue that structured traffic patterns across multiple jobs running simultaneously on the same fabric are not equivalent to random structured traffic, then this will go nowhere.

Patrick
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to