On Tue, 2007-04-24 at 08:55, Ashley Pittman wrote: > On Sat, 2007-04-21 at 13:16 +0200, Håkon Bugge wrote: > > PIO is a term with an two different > > interpretations. For a shared address space NIC, > > such as Dolphin's SCI adapters, PIO implies a > > sender CPU to write data directly into the user > > space of a remote process on a remote node. The > > cluster interconnect emulates a PCI to PCI bridge > > in this case. On other NICs, PIO implies using > > the processor to transmit the DMA description and > > the data to the local NIC. Then the local NIC > > issues a DMA to transmit the data/message to the > > remote node from a local buffer on the NIC. The > > main point is the local NIC doesn't have to issue > > a DMA read to local memory in order to read the DMA descriptor and data. > > That would explain why qlogic use PIO for up to 64k messages and we > switch to DMA at only a few hundred. For small messages you could best > describe what we use as a hybrid of the above descriptions, we write the > a network packet across the PCI bus and don't DMA at all. > > The downside to PIO of course is you need a CPU to drive it so besides > the fact it's slow you can't make do anything asynchronously. > > > So, when Mellanox reduces the latency from around > > 4 to around 1 usec, I assume they have modified > > the hardware-software interface of their HCA to > > enable PIO mode send operations, where DMA > > descriptor+data is transmitted on the PCI(e) bus > > using a single WC bus tenure. I haven't used a > > PCI analyzer on their HCAs, but a thumb of rule > > is that every I/O operation to a NIC takes in the > > order of 1usec. So may be they have managed to go > > from 3 to one I/O operation in order to kick off > > a transfer. Pure speculation fro my side though. > > That's an interesting theory, but I suspect your numbers are a little > out. My own measurements put a PIO word write in the region of .15 uSec > depending on chipset. Of course if you are right then the remaining PIO > write is happening in 1 uSec which leaves only .2uSec for the network > which seems a little fast to me. > > Regardless of how they have done it 1.2 is impressive, what would make > me even more impressed if it was quoted as 1.20 which would, as far as > I'm aware, mean that they had the lowest latency of anybody.
This is true if the 1.2 number is quoted through a switch, but as I understand it Mellanox quotes back-to-back numbers as their latency numbers. I have measured QLogic HTX adapters within 50ns of 1.0 usec if going back to back, but noone I'm aware of actually uses IB that way; everyone wants to run in a cluster with more than 2 nodes using a switch, so thats how we quote our latency. Disclosure: in case its not clear from the above, I do work at QLogic, but anyone with our HT cards can reproduce the above for themselves. -Kevin > > Ashley, > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf