Re: [Beowulf] interconnect and compiler ?

Håkon Bugge Fri, 30 Jan 2009 13:32:45 -0800

On Jan 30, 2009, at 21:24 , Vincent Diepeveen wrote:

Now that you're busy with all this, mind quoting interconnectsswitch latency?
For example: If one of the cores c0 on our box is busy receiving along messagefrom a remote node in the network, a message that will takesignificant time,can it switch in between let go through a short message meant forc1, and if so
what latency time does it take to receive it for c1?


Vincent,

This is a good comment. Multiple DMA engines/threads in the HCA and/ordifferent priority level to the DMAs are issues. I would claim, asusual in the HW vs. SW world, that the mechanisms are implemented inthe hardware, but the ability of software to take advantage of thismay not be there.

Since, small messages often are required to get trhough in order tostart a large one, e.g. rendevouz protocols, your example is relevant.You might have an interest in looking at the SPEC MPI2007 results at http://www.spec.org/mpi2007/results/mpi2007.html

Here you will find different MPI implementations using equal or verysimilar hardware, and/or different interconnects. Out of the 13applications constituting the SPEC MPI2007 medium suite, you will findthe milage varies significantly.


May be related to a response (in software) to your issue?

On Jan 30, 2009, at 6:06 PM, Greg Lindahl wrote:

Even logp doesn't describe an interconnect that well. It matters how

efficient your interconnect is at dealing with multiple cores, andthe

number of nodes. As an example of that, MPI implementations for
InfiniBand generally switch over to higher latency/higher overhead
mechanisms as the number of nodes in a cluster rises, because the
lowest latency mechanism at 2 nodes doesn't scale well.


[slight change of subject]

Greg, we (and your former colleagues at PathScale) have exchangedopinions on RDMA vs. Message Passing. Based on SPEC MPI2007, you willfind that an RDMA based DDR interconnect using Platform MPI performsbetter than a Message Passing stack using DDR. Looking at 16 nodes,128 cores, Intel E5472, the Message Passing paradigm is faster on 5(out of 13) application, whereas Platform MPI with its RDMA paradigmis faster on 8 of the applications. Further, when the Message Passingparadigm is faster, its never faster than 7% (on pop2). On the otherhand, when PMPI with its message passing is faster, we talk 33, 18,and 17%.

May be you will call it a single datapoint. But I will respond its 13applications. And frankly, I didn't have more gear to run on ;-)



Thanks, Håkon
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] interconnect and compiler ?

Reply via email to