On Wed, 28 Jun 2006, Patrick Geoffray wrote: > High message rate is good, but the question is how much is enough ? At 3 > million packet per second, that's 0.3 us per message which all of it is > used by the communication library. Can you name real world applications > that need to send messages every 0.3 us in a sustained way ? I can't, > only benchmarks do that. At 1 million packet per second, that one > message per microsecond. When does the host actually compute something ? > Did you measure the effective messaging rates of some applications ? > From you flawed white papers, you compared your own results against > numbers picked from the web, using older interconnect with unknown > software versions. Comparing with Myrinet D cards for example, you have > 4 times the link bandwidth and half the latency (actually, more like 1/5 > of the latency because I suspect most/all Myrinet results were using GM > and not MX), but you say that it's the messaging rate that drives the > performance ??? I would suspect the latency in most cases, but you > certainly can't say unless you look at it.
Hi Patrick -- I agree with you that the inverse of message rate, or the small message gap in logP-derived models is a more useful way to view the metric. How much more important it is than latency depends on what the relative difference is between your gap and your latency. One can easily construct a collective model using logP parameters to measure expected performance. When the latency is low enough, a gap of 0.3 versus 1.0 makes a difference, even in collectives that can be completed in logarithmic amount of steps. The fact that a process need not send 1 million messages per second is besides the point. In more than just the all-to-all case you cite, the per-message cost can determine the amount of time you spend in many MPI communication operations. > * high ratio of processes per NIC: that's actually the only good > argument. If you cannot increase the total message rate when you add > processes, then your message rate per process decrease. Again, the your > infamous marketing material is wrong: the bulk of the send overhead is > not in the NIC. Actually, the latency is wrong too (sources says a talk > from 2003, newsgroups and Pathscale estimations ?!?). For various > reasons (but not for any message rate consideration), we have moved the > reliability support from the NIC to the host, so the NIC overhead is now > much lower than the host overhead, which is dominated by the memory copy > for small messages. Would you have access to the source of the MX-1.2 > library, you would have seen it (You will when it is available under the > widely known ftp.myri.com password, we always ship the source of our > libraries, not like other vendors :-o ). > So, the message rate does increase with the number of processes with > MX-1.2, but it will still be bounded by the NIC overhead, which is > likely more important with Myrinet (2G or 10G) that with your NIC. > However, this is the same question: how much is enough per process ? > Sharing one NIC for 8 cores is, IMHO, a bad idea. A ratio of one NIC for > 2 core was quite standard the last decade. It may make economical sense > to increase this ratio to one NIC for 4 cores, but I would not recommend > to go higher than that. And with the number of cores going up (if people > actually buy many-cores configurations, the sweet spot is definitively > not at 8-way), it will make a lot of sense to use hybrid > shared-memory/interconnect for collective communications. In this > context, the message rate requirement of an all-to-all is not shared > among processes. I'm not ready to put my stake in the ground in predicting how many cores will drive each NIC in the near future. The past decade didn't have multi-core and upcoming price/performance points may warrant employing more cores on each node. Sure, a single core will always be able to exploit full bandwidth on a single NIC. However, strong scaling and some of the more advanced codes that don't always operate in areas of peak bandwidth can provide enough head room in available bandwidth for other cores to use. Even if 4 or 8 cores oversubscribes a single NIC, why not use the cores if it so happens that the communication patterns and message sizes still allow you to improve your time-to-solution? After all, time-to-solution is what it's all about. Sure, a second NIC will always help, but getting the best performance out of a single NIC and maintaining scalable message rates as the number of per-node cores increases is a useful metric. > Finally, you don't talk much about the side effects of your > architectural decisions, such as no little/no overlap and high CPU overhead. We can have a discussion on the correlation of programming models and their impact on architecture. Unfortunately, there's not much to say here in terms of side effects relative to everyone's favorite programming model -- MPI-1. While it's undeniable that judicious use of non-blocking operations on networks with offload engines can lead to better effective performance, how this capability correlates to applications that people are writing is the real question. What's unclear with this type of overlap is the performance/portability you get in using more advanced MPI communication techniques. The actual amount of potential communication and computation overlap varies from vendor to vendor. One way to fix this at the application level is to make the computation adaptive to the amount of communication that can be overlapped. This is feasible, but is tricky and often written only by MPI communication experts. Plus, there's the fact that it's in every vendor's interest to optimize the more basic (but less exciting) MPI communication primitives to support all the existing codes. Life with MPI-2 wouldn't solve the problem either. Most vendors choose to expose their offload engines through a generally usable RDMA interface and have to face the fact that the MPI-2 passive/active model imposes a semantic mismatch and added synchronization. The point of one-sided is to remove much of the implied synchronization you get with MPI-1 and allow applications that have low synchronization requirements to benefit from pure data transfers. An architecture that allows overlap through RDMA mechanisms can suit these applications very well, but the remaining problem seems to be lining up an RMA standard that users can understand and architectures can implement with low added costs. Even with MPI-1, much of the RDMA semantics have to be retrofitted to implement MPI's matched ordered envelope model -- you already know this and much research (and still much more!) has gone into optimizing this retrofit. What MPI needs is an RDMA mode so people can fully exploit their hardware for the characteristics it has. In the mean time, people should visit other programming models that have a tighter fit with RDMA like global address space languages. If that doesn't do, stick to the performance/portable MPI-1 communication operations. > The white papers are right on one thing: latency and bandwidth are not > enough to fully describe an interconnect. But message rate is just one > of the metrics, and I assert that it's not a particularly important one. > I suspect that Pathscale picked message rate as a marketing drum because > no other interconnects really cared about it. That's was the > differentiation bullet from the business workshop I attended. If you believe that logP-derived models can be useful to predict some areas of interconnect performance, message rate (or small message gap) is simply the missing parameter to the model once one has latency and bandwidth metrics. Of course, I can't be confident that it adequately measures performance for all cluster sizes, message sizes and communication patterns, but it is not just a futile marketing metric. cheers, -- Christian Bell [EMAIL PROTECTED] _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf