Re: [Beowulf] Three notes from ISC 2006

Patrick Geoffray Wed, 28 Jun 2006 08:19:22 -0700

Greg Lindahl wrote:

On Wed, Jun 28, 2006 at 07:28:53AM -0400, Patrick Geoffray wrote:

I have keep it quiet even when you where saying things driven by
marketing rather than technical considerations (the packet per
second nonsense),


Patrick, that "packet per second nonsense" is the technical reason our
interconnect does so well. If you'd like to argue about it,
technically, I'd be happy to do so. No need to keep quiet.

My reservation was about the way you present it, not the technical ideabehind. Actually, my real concern was that there was no technicalcontent in your post, just references to white papers, ie marketing fluff.

So, let's finally talk about the technical part. You claim that the keymetric in your product is the messaging rate, ie the number of packetsyou can send per second. You even have a fancy name for it, somethinglike Hyper Duper Messaging :-)

From the infamous white papers that I have seen, it looks like you cansend 3 Million packets per second for one process for small packets. Ican send about 1 Million packets per second on D card with MX-1.2 (not560 K as your company claim in the marketing material). I could maybedouble that if I work on it. Anyway, you say that this is why yourinterconnect is so much better.

High message rate is good, but the question is how much is enough ? At 3million packet per second, that's 0.3 us per message which all of it isused by the communication library. Can you name real world applicationsthat need to send messages every 0.3 us in a sustained way ? I can't,only benchmarks do that. At 1 million packet per second, that onemessage per microsecond. When does the host actually compute something ?Did you measure the effective messaging rates of some applications ?From you flawed white papers, you compared your own results againstnumbers picked from the web, using older interconnect with unknownsoftware versions. Comparing with Myrinet D cards for example, you have4 times the link bandwidth and half the latency (actually, more like 1/5of the latency because I suspect most/all Myrinet results were using GMand not MX), but you say that it's the messaging rate that drives theperformance ??? I would suspect the latency in most cases, but youcertainly can't say unless you look at it.


So, the two only cases where high message rate make some sense is:

* personalized all-to-all: an application may need to send small packetsto many destinations after a computing phase. In this case, you want tosend them as fast as possible, obviously. But this burst ofcommunication is limited in size. Worst case is a naive personalizedall-to-all, ie one message per peer. how many messages is that, howoften in time ? Does that sum to 3 million packet per second ? I don'tthink so. You also have to receive from the peers, and the skew betweenprocesses is more likely to cost more that the time to send all of yourmessages. My assertion is that 1 million packets per second is goodenough, I have never seen this metric being a bottleneck in anyapplication profiling I have done. Apparently, neither did the otherinterconnect vendors (I don't expect Mellanox to look at these things,but Quadrics people are no beginners). With interconnect doing realoffload, you will actually queue the messages and the NIC will processthem. The message rate for a limited burst is actually the memory copyperformance, as the MPI_Send for a small/medium message will return justafter the data is copied out of the application send buffer. The time ittakes for the NIC to process the sends asynchronously is not null, butit is usually not the bottleneck either, compared to the synchronizationoverhead of the all-to-all.

* high ratio of processes per NIC: that's actually the only goodargument. If you cannot increase the total message rate when you addprocesses, then your message rate per process decrease. Again, the yourinfamous marketing material is wrong: the bulk of the send overhead isnot in the NIC. Actually, the latency is wrong too (sources says a talkfrom 2003, newsgroups and Pathscale estimations ?!?). For variousreasons (but not for any message rate consideration), we have moved thereliability support from the NIC to the host, so the NIC overhead is nowmuch lower than the host overhead, which is dominated by the memory copyfor small messages. Would you have access to the source of the MX-1.2library, you would have seen it (You will when it is available under thewidely known ftp.myri.com password, we always ship the source of ourlibraries, not like other vendors :-o ).So, the message rate does increase with the number of processes withMX-1.2, but it will still be bounded by the NIC overhead, which islikely more important with Myrinet (2G or 10G) that with your NIC.However, this is the same question: how much is enough per process ?Sharing one NIC for 8 cores is, IMHO, a bad idea. A ratio of one NIC for2 core was quite standard the last decade. It may make economical senseto increase this ratio to one NIC for 4 cores, but I would not recommandto go higher than that. And with the number of cores going up (if peopleactually buy many-cores configurations, the sweet spot is definitivelynot at 8-way), it will make a lot of sense to use hybridshared-memory/interconnect for collective communications. In thiscontext, the message rate requirement of an all-to-all is not sharedamong processes.

Finally, you don't talk much about the side effects of yourarchitectural decisions, such as no little/no overlap and high CPU overhead.

The white papers are right on one thing: latency and bandwidth are notenough to fully describe an interconnect. But message rate is just oneof the metrics, and I assert that it's not a particularly important one.I suspect that Pathscale picked message rate as a marketing drum becauseno other interconnects really cared about it. That's was thedifferentiation bullet from the business workshop I attended.


Patrick

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Three notes from ISC 2006

Reply via email to