On Thu, 12 Apr 2007, H?kon Bugge wrote: > The dataset is fixed, elapsed time includes > initialization, write of animation files and > more. Hence, slower per node performance would > _scale_ better.
My comparison and measured scalability is based on each node's speedup relative to their own 2p performance. Both show the same *relative* speedup until 32p, where one of the two configurations doesn't match the other in relative scalability. > what I have shown is that an RDMA interconnect > performs faster than a message passing > interconnect which has roughly 3x lower latency > and 20x (?) higher message rate upto a scaling > point where the RDMA _implementation_ collapses. I don't know about the 3x/20x numbers. I can tell you that in the ls-dyna message profiles that I've looked at (for 4p to 32p), the application is dominated by large messages with the neon_reference dataset, so latency and per-message overhead are not likely to be important performance determinants. > And this _despite_ the fact the RDMA based MPI > has to perform the MPI message matching. I wouldn't overstate the cost of the matching as so. The fact that an MPI implementation employs RDMA to send MPI envelopes makes the matching cost apparent to that implementation, but everyone implementing MPI has to pay the non-zero cost of message matching somewhere. > I doubt you're missing anything;-) But let me > stress that as the number of cores per node > scale, a message passing semantics HCA with > message matching in the HCA will have a constant > message matching rate. An RDMA based MPI which > uses the cores for message matching, the message > matching rate would be almost proportional to the number of > cores... Your point brings up a few interesting questions, but I'd further contribute to it by separating interface from implementation. Since RDMA is really a pure form of low-level data movement with very little implied control, there's no specification as to how to do the message matching. With what most people agree to call RDMA, the matching has to be done as a separate operation once the data movement has happened. A message matching interfaces implies elements of data movement and control, and "matching in the NIC" is just an implementation of one of these control operations. However, a message matching interface is broader in its specification, the message matching can happen on either side of the PCI bus. To my knowledge, only a fraction of interconnects with "message matching" APIs do the message matching on "the I/O side" of the PCI bus. I'd be interested in hearing what their take is on pursuing matching in the NIC in the face of an increasing number of cores per node. Matching in the NIC can be extremely painful to implement -- memory constraints for potentially long match lists (although those long lists are rare), the fact that MPI_ANY_SOURCE turns the match lists into serialization points between shared-memory and interconnect communication (more complexity & synchronization over the PCI bus, etc). I would have said that matching in the NIC was a clear win a few years ago but now that processing cores are a-plenty and that the NIC has become a serialization point for more of these cores, the design space has changed considerably. . . christian -- [EMAIL PROTECTED] (QLogic SIG, formerly Pathscale) _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf