Gilad,

Gilad Shainer wrote:
So now we can discuss technical terms and not marketing terms such
as price/performance. InfiniBand uses 10Gb/s and 20Gb/s link signaling
rate. The coding of the data into the link signaling is 8/10. When someone refer to 10 and 20Gb/s, it is for the link speed and there is nothing confusing here - this is InfiniBand specification (and a standard if I may say).

You don't have a 1.25 Gb/s Ethernet on your laptop, do you ? Gigabit Ethernet signal rate is 1.25 Gb/s per the standard, but data rate is 1 Gb/s after 8b/10b encoding and that's why everybody calls it Gigabit Ethernet. Same thing with 10 GigE (it's 12.5 Gb/s signal rate) and with Myrinet 2G (2.5 Gb/s signal rate) or 10G (12.5 Gb/s signal rate), both according to their respective standard.

Similarly, my laptop has 1 GB of memory, not 1.125 GB of parity memory.

By using signaling rate instead of data rate, you are going against all conventions in networking. There is no technical basis for that choice, except that 10 is bigger than 8.

The PCIe specification is exactly the same. Same link speed and same 8/10 data encoding. When you say 13.7Gb/s you confuse between the specification and the MTU (data size) that some of the chipsets support.

I just pointed out that your claim that bigger pipes have better application performance has to be adjusted with the effective throughtput for the application. IB SDR has actually less effective bandwidth than 10 GigE, and IB DDR cannot go more than 1.3x faster today.

For chipsets that support MTU > 128B, your calculation is wrong and the data throughput is higher.

Please, name one PCI Express chipset that implements Read Completions larger than 128 Bytes. Do no confuse it with the May Payload size, which is from the Write operations. Read Completions are as large as the transaction size on the memory bus (and that makes a lot of sense if you think about it). Intel chipsets can do PCIE combining to reach 128 Bytes, and they could in theory combine on a larger buffer. However, nobody does today.

With PCIE 2.0 doubling the bandwidth, then you will be able to say that IB 16 Gb/s is twice as fast as IB 8 Gb/s for application, but not today.

What is also interesting to know, is when one uses InfiniBand 20Gb/s
he/she
Can fully utilized the PCIe x8 link, while in your case, Myricom I/O
interface is the bottleneck.

If you have a look at the following web page, you will see the effective bandwidth supported by a large variety of PCI Express chipsets:
http://www.myri.com/scs/performance/PCIe_motherboards/

This is a pure PCIE Express DMA measurement, no network involved. You will see that some do not even sustain 10 Gb/s on the Read direction, and most just barely sustain 20 Gb/s in bidirectional. In many motherboards, 10G *does* saturate the PCIE x8 link.

saw from 3 non-bias parties. In all the application benchmarks, Myrinet 2G shows poor performance comparing to 10 and 20Gb/s. As for the registration cache comment, I would go back to the "famous" RDMA paper and the proper responds from IBM and others. The answer to this comment is fully described in those responses.

I strongly advice you to read the related posts by Christian at Qlogic and learn from it. He gets it, you don't. The IBM guy has never programmed RDMA interconnects and dealt with memory registration (I am not sure he ever programmed anything), and you neither apparently.

Here, try this simple benchmark: pingpong incrementing the send and receive buffer at each iteration. Tell me if IB beats GigE for, say, 64KB messages.

Similarly, on many applications I have checked, Qlogic IB SDR has better performance than Mellanox IB DDR, despite having a smaller pipe (and despite Mellanox claiming the contrary).


Are you selling Myricom HW or Qlogic HW?

I don't do hardware or sales, believe it or not. I am a software guy. You tell me that a bigger pipe is better, I reply by example that it's not about the size of the pipe, it's how you use it. Qlogic have a smaller one, but they don't mind.

and not only on pure latency or pure bandwidth. Qlogic till recently (*)
had the lowest latency number but when it comes to application, the CPU
overhead is too high. Check some papers on Cluster to see the
application
results.

You really do not understand MPI implementations. Qlogic send and recv overhead is a problem for large messages, but small and medium messages are much more important for MPI applications. For these message sizes, it is actually faster to copy on both side to pre-registered buffers than doing a rendez-vous to do zero-copy. What is the difference between PIO on send side and Copy on receive (what Qlogic does), versus Copy on both sides ?

Qlogic design does cut corners for large message, but the tradeoff is that it keeps the design simple (thus easier to implement) without affecting too much application performance.

Don't get me wrong, Qlogic is my competitor too, and sometimes I savagely want to cut Greg's hair when he is wrong, but they mostly (and definitively Quadrics) know what they are doing.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to