On Fri, 23 Mar 2007, Gilad Shainer wrote: > Are you selling Myricom HW or Qlogic HW?
Based on what I know, I think it's perfectly reasonable for Patrick to expect that a messaging technology can outdo the other for reasons other than higher signaling rates. > In general, application performance depends on the interconnect > architecture and not only on pure latency or pure bandwidth. > Qlogic till recently (*) had the lowest latency number but when it > comes to application, the CPU overhead is too high. Check some High cpu overhead, as opposed to expected less overheads from offloading, is a very crude characterisation of interconnect performance. Plus, if I remember correctly, this is not true for many important message sizes and communication patterns. Offload, usually implemented by RDMA offload, or the ability for a NIC to autonomously send and/or receive data from/to memory is certainly a nice feature to tout. If one considers RDMA at an interface level (without looking at the registration calls required on some interconnects), it's the purest and most flexible form of interconnect data transfer. Unfortunately, this pure form of data transfer has a few caveats... How the programming model can match up with the semantics of RDMA is the real question. A quick sampling suggests that global-address space languages fit squarely on top of RDMA, whereas MPI-2 almost does if less of its windowing complexity is considered. MPI-1, the most popular model out there, has the least in common with RDMA offload. Under its simplest form in MPI implementations, RDMA can be used for half of the communication protocol involved in large messages. In its complex form, it can be used to handle small to medium-sized messages as shown by a few openib/iwarp MPI implementations (although these implementations really implement a complex assortment of hybrid RDMA and non-RDMA mechanisms to provide scalable performance). RDMA offload, depending on the complexity of its implementation, can buy you little to lots of communication offload (or "total" communication offload in Quadrics' case). But RDMA implementations aside, you can only offload what the programming model *and* the programmer will let you. Programmers must understand data dependencies in their codes and know where and how to separate communication initiation and completion points. Even well intentioned programmers can fail to expose their apps for communication offload -- complex legacy apps can be intimidating to modify, some apps may have strong data dependencies and others may be dominated by collectives which are themselves indivisible (i.e. blocking). And finally, a programmer who can successfully overcome over all these hurdles cannot expect to be provided with an equal level of overlap on all interconnects. There's a good reason that many programmers continue to find refuge in simple offload-less primitives like Send/Recv: the expectation that its in the interest of every MPI and interconnect vendor to provide the best Send/Recv possible. Many competent programmers will reap definite benefits from highly specialized implementations of RDMA offload. But then again, these programmers will also know how to analyse their applications and may come to completely different conclusions. For example, they may come to realise that most of their codes cannot fully benefit from offload and that the interconnect that spends the least time in specific MPI primitives is the best choice -- hardware-assisted operations, pt-to-pt midsize message performance or consistent cluster-wide message latency, etc. Understanding the expected performance of specific communication primitives is an application-centric view of performance evaluation. Assuming that more cores necessarily require fatter pipes, pt-to-pt latency measurements, signaling rates, messaging rates, etc. are all microbenchmark-centric view of interconnect evaluation. Picking on the latter is just too simplistic and rarely translates into a general and verifiable view of the world, but it's good fodder for oneupmanship and insipid (but entertaining) inter-vendor bickering. RDMA offload is attractive for many other reasons but in the context of today's most popular programming model it isn't as vital as one would like. It's reasonable conventional wisdom that offload is a desirable feature, but the way programming models have been moving (i.e. not moving), interconnects that do not offer elaborate communication offload mechanisms are not at a loss, far from it. Efficiently exploiting a low-level RDMA engine for the purposes of message passing would mean enabling its pure data transfer capability to percolate through the many levels of software stack and programming model semantics mostly unscathed. This is an unrealistic expectation. I've yet to see a significant number of message-passing applications show that an RDMA offload engine, as opposed to any other messaging engine, is a stronger performance determinant. That's probably because there are other equally important and desirable features implemented in other messaging engines. cheers, . . christian -- [EMAIL PROTECTED] (QLogic SIG, formerly Pathscale) _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf