Hi Joachim,

Joachim Worringen wrote:
An offer for "getting a secret white paper on request" is marketing, you are right. But at least the SPEC number was technical content - and we don't want to analyse every posting sentence-by-sentence, do we?

The SPEC stuff was actually fine. I didn't register it in my brain because I don't care about compiler stuff, but you are right. Actually, the white paper was borderline but acceptable, the I-know-something-but-I-can-tell-you was the problem, and I trust that Greg will be aware of the sensibilities.

Let me summarize what I consider the key issues:
- explicit MPI_Irecv/MPI_Send/MPI_Wait, or similar patterns implicitely in MPI_Reduce/MPI_Alltoall/MPI_Allreduce with small messages (a few doubles, or a few kB) are the dominant communication pattern in many MPI applications. There are quite some (but not as many as one could wish) studies that show this. - This means it's generally a good thing if the "ping" latency (duration of MPI_Send in number of CPU cycles) is as low as possible.

There are two metrics here. The latency is the time it takes for the message to be received by the other side. The duration of MPI_Send is actually the send overhead, typically the time it takes to copy the data from the application buffer to an internal buffer. It makes no sense to do zero-copy for small messages, unless you patch the kernel to not have to deal with memory registration (Quadrics can do that) or unless you use a lightweight kernel that has no virtual memory (Cray and Blue Gene do that).

- At this message size, CPU utilization or overlapping computing and communication is not relevant, as (zero-copy) RDMA does not pay off until the message gets at least some (typically >32, or more) kB in size, due to the implied pinning and rendez-vous overhead. Also, MPI_Send has no opportunity for overlap, and having a progress thread on the receive CPU steal cycles from the application doesn't really help, neither.

Absolutely correct. Overlap is irrelevant for small messages. Progress can be a problem with extreme cases though: if you have a lot of incoming small messages but the application does not consume them or call MPI, then you will have a flow control problem and progression is useful. But this is a pathological case.

- In these cases, all(?) interconnects do some sort of memcpy() within MPI_Send to get rid of the data. The differences are * How long does it take to prepare things for the memcpy()? This is Greg's message rate.

I don't think it's the time it take to prepare things. For very small message, Greg and I both do PIO, ie we copy directly to the NIC. I do that up to 128 bytes, because PIO writes stall you processors on slow IO bus. Greg does that for all messages sizes on send side, from what I hear from the grapevine. For 128 Bytes to 32K, I do (pipelined memcpy + DMA), and then zero-copy above 32KB. I could do PIO writes up to 4 KB for example, and this is exactly what I will do for MX-10G because PCI-Express will not stall the processor as much as PCI-X does. That's a tradeoff between PIO writes and memcpy/DMA and the parameters are different on Hypertransport or PCI-X/PCI-Express.

But I don't think that Greg's "Real Appliation Performance" white paper is infamous. It states where the data comes from, you have to trust him for his own numbers, and it does not directly link the differences in the application performance to the messaging rate. Of course, it does not offer a scientific analysis, and you can not compare it to papers like the ones from Leonid Oliker. But I don't think it's unfair, and surely stimulates the competition for better technical solutions or better white papers.

White papers are evil by definition. They show what you want to show, and there is no peer review so you can say what you want.

It's not fair to use old hardware/software or use third parties results that you know nothing about. If you want to do comparison, get your hand on your competitors products and do the testing yourself. We bought a Quadrics cluster a long time ago to do just that :-) You can also ask friends to get access to clusters. The web is the last place I would look to find reliable information.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to