Greg Lindahl wrote:
On Wed, Jun 28, 2006 at 07:28:53AM -0400, Patrick Geoffray wrote:

I have keep it quiet even when you where saying things driven by
marketing rather than technical considerations (the packet per
second nonsense),

Patrick, that "packet per second nonsense" is the technical reason our
interconnect does so well. If you'd like to argue about it,
technically, I'd be happy to do so. No need to keep quiet.

My reservation was about the way you present it, not the technical idea behind. Actually, my real concern was that there was no technical content in your post, just references to white papers, ie marketing fluff.

So, let's finally talk about the technical part. You claim that the key metric in your product is the messaging rate, ie the number of packets you can send per second. You even have a fancy name for it, something like Hyper Duper Messaging :-)

From the infamous white papers that I have seen, it looks like you can send 3 Million packets per second for one process for small packets. I can send about 1 Million packets per second on D card with MX-1.2 (not 560 K as your company claim in the marketing material). I could maybe double that if I work on it. Anyway, you say that this is why your interconnect is so much better.

High message rate is good, but the question is how much is enough ? At 3 million packet per second, that's 0.3 us per message which all of it is used by the communication library. Can you name real world applications that need to send messages every 0.3 us in a sustained way ? I can't, only benchmarks do that. At 1 million packet per second, that one message per microsecond. When does the host actually compute something ? Did you measure the effective messaging rates of some applications ? From you flawed white papers, you compared your own results against numbers picked from the web, using older interconnect with unknown software versions. Comparing with Myrinet D cards for example, you have 4 times the link bandwidth and half the latency (actually, more like 1/5 of the latency because I suspect most/all Myrinet results were using GM and not MX), but you say that it's the messaging rate that drives the performance ??? I would suspect the latency in most cases, but you certainly can't say unless you look at it.

So, the two only cases where high message rate make some sense is:

* personalized all-to-all: an application may need to send small packets to many destinations after a computing phase. In this case, you want to send them as fast as possible, obviously. But this burst of communication is limited in size. Worst case is a naive personalized all-to-all, ie one message per peer. how many messages is that, how often in time ? Does that sum to 3 million packet per second ? I don't think so. You also have to receive from the peers, and the skew between processes is more likely to cost more that the time to send all of your messages. My assertion is that 1 million packets per second is good enough, I have never seen this metric being a bottleneck in any application profiling I have done. Apparently, neither did the other interconnect vendors (I don't expect Mellanox to look at these things, but Quadrics people are no beginners). With interconnect doing real offload, you will actually queue the messages and the NIC will process them. The message rate for a limited burst is actually the memory copy performance, as the MPI_Send for a small/medium message will return just after the data is copied out of the application send buffer. The time it takes for the NIC to process the sends asynchronously is not null, but it is usually not the bottleneck either, compared to the synchronization overhead of the all-to-all.

* high ratio of processes per NIC: that's actually the only good argument. If you cannot increase the total message rate when you add processes, then your message rate per process decrease. Again, the your infamous marketing material is wrong: the bulk of the send overhead is not in the NIC. Actually, the latency is wrong too (sources says a talk from 2003, newsgroups and Pathscale estimations ?!?). For various reasons (but not for any message rate consideration), we have moved the reliability support from the NIC to the host, so the NIC overhead is now much lower than the host overhead, which is dominated by the memory copy for small messages. Would you have access to the source of the MX-1.2 library, you would have seen it (You will when it is available under the widely known ftp.myri.com password, we always ship the source of our libraries, not like other vendors :-o ). So, the message rate does increase with the number of processes with MX-1.2, but it will still be bounded by the NIC overhead, which is likely more important with Myrinet (2G or 10G) that with your NIC. However, this is the same question: how much is enough per process ? Sharing one NIC for 8 cores is, IMHO, a bad idea. A ratio of one NIC for 2 core was quite standard the last decade. It may make economical sense to increase this ratio to one NIC for 4 cores, but I would not recommand to go higher than that. And with the number of cores going up (if people actually buy many-cores configurations, the sweet spot is definitively not at 8-way), it will make a lot of sense to use hybrid shared-memory/interconnect for collective communications. In this context, the message rate requirement of an all-to-all is not shared among processes.

Finally, you don't talk much about the side effects of your architectural decisions, such as no little/no overlap and high CPU overhead.

The white papers are right on one thing: latency and bandwidth are not enough to fully describe an interconnect. But message rate is just one of the metrics, and I assert that it's not a particularly important one. I suspect that Pathscale picked message rate as a marketing drum because no other interconnects really cared about it. That's was the differentiation bullet from the business workshop I attended.

Patrick

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to