Hey Larry, Larry Stewart wrote:
Does anyone know, or know where to find out, how long it takes to do a store to a device register on a Nahelem system with a PCIexpress device?
Are you asking for latency or throughput ? For latency, it depends on the distance between the core and the IOH (each QuickPath hop can be ~100 ns if I remember well) and if there are PCIe switches before the device. For throughput, it is limited by the PCIe bandwidth (~75% efficiency of link rate) but you can reach it with 64 Bytes writes.
Also, does write combining work with such a setup?
Sure, write-combining works on all Intel CPUs since PentiumIII. It only burts at 64 bytes though, anything else is fragmented at 8 bytes. AMD chips do flush WC at 16, 32 and 64 bytes.
And don't assume that because you have WC enabled you will only have 64 bytes writes. Sometimes, specially when there is an interrupt, the WC buffer can be flushed early. And don't assume order between the resulting multiple 8-byte writes either.
I recall that the QLogic Infinipath uses such features to get good short message performance, but my memory of it is pre- Nahelem.
Nehalem just add NUMA overhead, and a lot more memory bandwidth.
Question 2 - if the device stores to an address on which a core is spinning, how long does it take for the core to return the new value?
On NUMA, it depends if you write on the socket where you are spinning. If it's same socket, the cache is immediately invalidated. If you busy-poll on a different socket, cache coherency gets involved and it's more expensive. On the local socket, I would say between 150 and 250ns, assuming no PCIe switch in between (those can cost 150ns more).
Patrick _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf