On Wed, 25 Apr 2007, Ashley Pittman wrote: > You'd have thought that to be the case but PIO bandwidth is not a patch > on DMA bandwidth. On alphas you used to get a performance improvement > by evicting the data from the cache immediately after you had submitted > the DMA but this doesn't buy you anything with modern machines.
Not a patch, but the main goal in many of our cases is minimizing the amount of time spent in MPI where conventional wisdom about offload doesn't unconditionally apply (and yes this contrary to where I think programming models should be headed). > > Just to make sure we compare the same thing; the > > .15usec is the time from the CPU issuing the > > store instruction until the side effect is > > visible in the HCA? In other words, assume a CSR > > word read takes 0.5usec, a loop writing and > > reading the same CSR take 0.65usec, right? If > > that the case, CSR accesses have improved radically the last years. > > Until it's visible is another question, typically we write a number of > values and then flush them, how soon you can see the data from the NIC > before the flush is almost entirely chipset dependant. As you say reads > are very bad and we avoid them wherever possible. I recently measured that it takes InfiniPath 0.165usec to do a complete MPI_Isend -- so in essence this is 0.165usec of software overhead that also includes the (albeit cheap Opteron) store fence. I don't think that queueing a DMA request is much different in terms of software overhead. For small messages, I suspect that most of the differences will be in the amount of time the request (PIO or DMA) remains queued in the NIC before it can be put on the wire. If issuing a DMA request implies more work for the NIC compared to a PIO that requires no DMA reads, this will be apparent in the resulting message gap (and made worse as more sends put in flight). In this regard, we have a pretty useful test in GASNet called testqueue to measure the effect of message gap as the number of sends are increased. Interconnects varied in performance -- QLogic's PIO and Quadrics's STEN have a fairly flat profile, whereas Mellanox/VAPI was not so flat after 2 messages in flight and my Myrinet results are from very old hardware. Obviously, I'd encourage everyone to run their own tests as various HCA revisions will have their own profiles. I should come up with this test in an MPI form -- GASNet shows these metrics with the lower-level software that is used in many MPI implementations, so comparing the MPI metrics to the GASNet metrics could help identify overheads in MPI implementations. . . christian -- [EMAIL PROTECTED] (QLogic SIG, formerly Pathscale) _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf