Re: [Beowulf] Performance characterising a HPC application

Patrick Geoffray Wed, 28 Mar 2007 21:55:42 -0800

Gilad,

Gilad Shainer wrote:

So now we can discuss technical terms and not marketing terms such
as price/performance. InfiniBand uses 10Gb/s and 20Gb/s link signaling
rate. The coding of the data into the link signaling is 8/10. Whensomeone refer to 10 and 20Gb/s, it is for the link speed and thereis nothing confusing here - this is InfiniBand specification (and astandard if I may say).

You don't have a 1.25 Gb/s Ethernet on your laptop, do you ? GigabitEthernet signal rate is 1.25 Gb/s per the standard, but data rate is 1Gb/s after 8b/10b encoding and that's why everybody calls it GigabitEthernet. Same thing with 10 GigE (it's 12.5 Gb/s signal rate) and withMyrinet 2G (2.5 Gb/s signal rate) or 10G (12.5 Gb/s signal rate), bothaccording to their respective standard.


Similarly, my laptop has 1 GB of memory, not 1.125 GB of parity memory.

By using signaling rate instead of data rate, you are going against allconventions in networking. There is no technical basis for that choice,except that 10 is bigger than 8.

The PCIe specification is exactly the same. Same link speed and same8/10 data encoding. When you say 13.7Gb/s you confuse between thespecification and the MTU (data size) that some of the chipsets support.

I just pointed out that your claim that bigger pipes have betterapplication performance has to be adjusted with the effectivethroughtput for the application. IB SDR has actually less effectivebandwidth than 10 GigE, and IB DDR cannot go more than 1.3x faster today.

For chipsets that support MTU > 128B, your calculation is wrong and thedata throughput is higher.

Please, name one PCI Express chipset that implements Read Completionslarger than 128 Bytes. Do no confuse it with the May Payload size, whichis from the Write operations. Read Completions are as large as thetransaction size on the memory bus (and that makes a lot of sense if youthink about it). Intel chipsets can do PCIE combining to reach 128Bytes, and they could in theory combine on a larger buffer. However,nobody does today.

With PCIE 2.0 doubling the bandwidth, then you will be able to say thatIB 16 Gb/s is twice as fast as IB 8 Gb/s for application, but not today.

What is also interesting to know, is when one uses InfiniBand 20Gb/s
he/she
Can fully utilized the PCIe x8 link, while in your case, Myricom I/O

interface is the bottleneck.

If you have a look at the following web page, you will see the effectivebandwidth supported by a large variety of PCI Express chipsets:

http://www.myri.com/scs/performance/PCIe_motherboards/

This is a pure PCIE Express DMA measurement, no network involved. Youwill see that some do not even sustain 10 Gb/s on the Read direction,and most just barely sustain 20 Gb/s in bidirectional. In manymotherboards, 10G *does* saturate the PCIE x8 link.

saw from 3 non-bias parties. In all the application benchmarks, Myrinet2G shows poor performance comparing to 10 and 20Gb/s.As for the registration cache comment, I would go back to the "famous"RDMA paper and the proper responds from IBM and others. The answerto this comment is fully described in those responses.

I strongly advice you to read the related posts by Christian at Qlogicand learn from it. He gets it, you don't. The IBM guy has neverprogrammed RDMA interconnects and dealt with memory registration (I amnot sure he ever programmed anything), and you neither apparently.

Here, try this simple benchmark: pingpong incrementing the send andreceive buffer at each iteration. Tell me if IB beats GigE for, say,64KB messages.

Similarly, on many applications I have checked, Qlogic IB SDRhas better performance than Mellanox IB DDR, despite having asmaller pipe (and despite Mellanox claiming the contrary).
Are you selling Myricom HW or Qlogic HW?

I don't do hardware or sales, believe it or not. I am a software guy.You tell me that a bigger pipe is better, I reply by example that it'snot about the size of the pipe, it's how you use it. Qlogic have asmaller one, but they don't mind.

and not only on pure latency or pure bandwidth. Qlogic till recently (*)
had the lowest latency number but when it comes to application, the CPU
overhead is too high. Check some papers on Cluster to see the
application

results.

You really do not understand MPI implementations. Qlogic send and recvoverhead is a problem for large messages, but small and medium messagesare much more important for MPI applications. For these message sizes,it is actually faster to copy on both side to pre-registered buffersthan doing a rendez-vous to do zero-copy. What is the difference betweenPIO on send side and Copy on receive (what Qlogic does), versus Copy onboth sides ?

Qlogic design does cut corners for large message, but the tradeoff isthat it keeps the design simple (thus easier to implement) withoutaffecting too much application performance.

Don't get me wrong, Qlogic is my competitor too, and sometimes Isavagely want to cut Greg's hair when he is wrong, but they mostly (anddefinitively Quadrics) know what they are doing.


Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Performance characterising a HPC application

Reply via email to