Re: [Beowulf] Three notes from ISC 2006

Patrick Geoffray Wed, 28 Jun 2006 12:21:04 -0700

Hi Joachim,

Joachim Worringen wrote:

An offer for "getting a secret white paper on request" is marketing, youare right. But at least the SPEC number was technical content - and wedon't want to analyse every posting sentence-by-sentence, do we?

The SPEC stuff was actually fine. I didn't register it in my brainbecause I don't care about compiler stuff, but you are right. Actually,the white paper was borderline but acceptable, theI-know-something-but-I-can-tell-you was the problem, and I trust thatGreg will be aware of the sensibilities.

Let me summarize what I consider the key issues:
- explicit MPI_Irecv/MPI_Send/MPI_Wait, or similar patterns implicitelyin MPI_Reduce/MPI_Alltoall/MPI_Allreduce with small messages (a fewdoubles, or a few kB) are the dominant communication pattern in many MPIapplications. There are quite some (but not as many as one could wish)studies that show this.- This means it's generally a good thing if the "ping" latency (durationof MPI_Send in number of CPU cycles) is as low as possible.

There are two metrics here. The latency is the time it takes for themessage to be received by the other side. The duration of MPI_Send isactually the send overhead, typically the time it takes to copy the datafrom the application buffer to an internal buffer. It makes no sense todo zero-copy for small messages, unless you patch the kernel to not haveto deal with memory registration (Quadrics can do that) or unless youuse a lightweight kernel that has no virtual memory (Cray and Blue Genedo that).

- At this message size, CPU utilization or overlapping computing andcommunication is not relevant, as (zero-copy) RDMA does not pay offuntil the message gets at least some (typically >32, or more) kB insize, due to the implied pinning and rendez-vous overhead. Also,MPI_Send has no opportunity for overlap, and having a progress thread onthe receive CPU steal cycles from the application doesn't really help,neither.

Absolutely correct. Overlap is irrelevant for small messages. Progresscan be a problem with extreme cases though: if you have a lot ofincoming small messages but the application does not consume them orcall MPI, then you will have a flow control problem and progression isuseful. But this is a pathological case.

- In these cases, all(?) interconnects do some sort of memcpy() withinMPI_Send to get rid of the data. The differences are* How long does it take to prepare things for the memcpy()? This isGreg's message rate.

I don't think it's the time it take to prepare things. For very smallmessage, Greg and I both do PIO, ie we copy directly to the NIC. I dothat up to 128 bytes, because PIO writes stall you processors on slow IObus. Greg does that for all messages sizes on send side, from what Ihear from the grapevine. For 128 Bytes to 32K, I do (pipelined memcpy +DMA), and then zero-copy above 32KB. I could do PIO writes up to 4 KBfor example, and this is exactly what I will do for MX-10G becausePCI-Express will not stall the processor as much as PCI-X does. That's atradeoff between PIO writes and memcpy/DMA and the parameters aredifferent on Hypertransport or PCI-X/PCI-Express.

But I don't think that Greg's "Real Appliation Performance" white paperis infamous. It states where the data comes from, you have to trust himfor his own numbers, and it does not directly link the differences inthe application performance to the messaging rate. Of course, it doesnot offer a scientific analysis, and you can not compare it to paperslike the ones from Leonid Oliker. But I don't think it's unfair, andsurely stimulates the competition for better technical solutions orbetter white papers.

White papers are evil by definition. They show what you want to show,and there is no peer review so you can say what you want.

It's not fair to use old hardware/software or use third parties resultsthat you know nothing about. If you want to do comparison, get your handon your competitors products and do the testing yourself. We bought aQuadrics cluster a long time ago to do just that :-) You can also askfriends to get access to clusters. The web is the last place I wouldlook to find reliable information.


Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Three notes from ISC 2006

Reply via email to