The original question was about relatively small messages - only 500 bytes each

You can often get better throughput if you send say two smaller messages rather 
than one large one.
This is since the interconnect can generate multiple RDMA requests that can 
proceed concurrently.

This old paper from 2003 illustrates this
http://www.docstoc.com/docs/5579957/Quadrics-QsNetII-A-network-for-Supercomputing-Applications
Page 25 shows a graph where 1,2,4 and 8 concurrent RDMA are issued 
concurrently. For large messages (>256KB) there is no significant difference in 
the achieved total bandwidth - it is limited by the PCIe/PCI-X interface or the 
interconnect fabric itself.
But at smaller messages sizes there are measurable differences - eg. two 1K 
messages show higher total bandwidth than a single 2K message.

Daniel

p.s. did you really mean to compare three 500bytes transfers with a single 
2000byte transfer, rather than the same total message size in both cases?

pps. Case A is really a broadcast - interconnects that implement broadcast in 
hardware are bound to do A faster than B

From: beowulf-boun...@beowulf.org [mailto:beowulf-boun...@beowulf.org] On 
Behalf Of Bruno Coutinho
Sent: 23 May 2009 16:44
To: tri...@vision.ee.ethz.ch
Cc: beowulf@beowulf.org
Subject: Re: [Beowulf] MPI - time for packing, unpacking, creating a message...

If you are using Gigabit Ethernet with jumbo frames (9000 bytes for example):
A will send 3 packets with 4000 bytes and
B will send one of 9000 bytes and one of 7000 bytes.

For the cpu B is better, because will generate one system call and A will 
generate three and
as many high speed interconnects today need large packets to fully utilize 
their bandwidth, I think that B should be faster.
But the only way to be sure is testing.


2009/5/18 <tri...@vision.ee.ethz.ch<mailto:tri...@vision.ee.ethz.ch>>
Hi all,

is there anyone who can tell me if A) or B) is probably faster?

A)
process 0 sends 3x500 elements, e.g. doubles, to 3 different processors using 
something like
if(rank==0){
MPI_Send(sendbuf, 500, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
MPI_Send(sendbuf, 500, MPI_DOUBLE, 2, 2, MPI_COMM_WORLD);
MPI_Send(sendbuf, 500, MPI_DOUBLE, 3, 3, MPI_COMM_WORLD);
}
else
MPI_Recv(recvbuf, 500, MPI_DOUBLE, 0, rank, MPI_COMM_WORLD, status);


B)
process 0 sends 2000 elements to process 1 using
if(rank==0)
MPI_Send(sendbuf, 2000, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
else
MPI_Recv(recvbuf, 2000, MPI_DOUBLE, 0, rank, MPI_COMM_WORLD, status);


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org<mailto:Beowulf@beowulf.org> sponsored 
by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to