Re: [Beowulf] Correct networking solution for 16-core nodes

Joachim Worringen Thu, 03 Aug 2006 02:32:45 -0700

Tahir Malas wrote:

-----Original Message-----
From: Vincent Diepeveen [mailto:[EMAIL PROTECTED]

[...]

Quadrics can work for example as direct shared memory among all
nodes when you program for its shmem, which means that for short
messages you can simply share from its 64MB ram on card something
and share that array over all nodes. You just write normal in this array
and the cards take care it gets synchronised.

That is how LAM_MPI handles short messages in a SMP node, isn't it? But we
don't change anything in the MPI routines, if the message is short, it
handles it as of shmem.

This comparison is a little bit simplistic. With a shared-memory basedMPI, it can also happen that data is transfered with one copy from sendbuffer to shared buffer, then another copy to receive buffer - if themessage is expected. Otherwise, another copy operation to a temporarybuffer is required. This never happens using the SHMEM API. Also, MPIadds message headers and does message matching, which is not requiredfor SHMEM.

For IB of Voltaire for example there are some opportunities. Single port vs
dual port, MEMfree vs 128-256 MB RAM (which sounds similar to Quadrics), and
more importantly BUS interface; PCI-X,PCI-E, or AMD Hyper-transport (HTX).
HTX is said to provide 1.3us latency by connecting directly to the AMD
Opteron processor via a standard HyperTransport HTX slot. HyperTransport HTX
slot means RAM slots on the mb? Then we have to sacrifice some slots for
NIC? Well, at the end it is still unclear to me which one and how many to

choose.

No, you use dedicated HTX slots for the NIC. HTX is not found on themajority of Opteron mainboards, but a number of HTX server boards doesexsit.

From the numbers published by Pathscale, it seems that the simple MPIlatency of Infinipath is about the same whether you go via PCIe or HTX.The application perfomance might be different, though.

Other than that latency, you have to realize that still the latency of
those
cards is
ugly compared to the latencies within the quad boxes.

If you have 8 threads running or so in those boxes and you use an IB card,
then it'll have a switch latency.

Only quadrics is clear about its switch latency (probably competitors have
a
worse
one). It's 50 us for 1 card.

Where did you find these numbers? Such a huge delay should be easy tomeasure using a simple MPI benchmark? I.e. Pathscale's "mpi_multibw"?

But if we directly connect two boxes without a switch, then we can achieve
this latency I hope?


No, the describedlatency is a node-internal latency.

 Jachim

--
Joachim Worringen, Software Architect, Dolphin Interconnect Solutions
phone ++49/(0)228/324 08 17 - http://www.dolphinics.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Correct networking solution for 16-core nodes

Reply via email to