On Thu, 2005-01-12 at 16:16 -0800, David S. Miller wrote:
> From: "Ronciak, John" <[EMAIL PROTECTED]>
> Date: Thu, 1 Dec 2005 15:10:19 -0800
> 
> > Do you have a specific example?  We have tried this on Intel (x86 and
> > IA-64), AMD (x86 and x86_64) and PPC processors.  Most show gains with
> > none showing any detriment to performance with the test cases.  The
> > gains seen on the ones that show gains are significant in almost all
> > cases.  
> > 
> > If you have a specific case where you think that there are measurable
> > problems with it we would be happy to hear it so we can take a look at
> > it.
> 
> I think until a counter case is shown, the prefetches should
> remain on unconditionally.  Proof of detriment is the burdon
> of the accusor, especially since the Intel folks aparently
> did a lot of testing :-)
> 

So let me take the liberty of first cc-ing all the usual suspects i have
talked to about this in the past when i attempted to use prefetching on
e1000. Then i will try to make a case and eventually collect some newer
evidence ;-> I will be happy actually if it turns out i am wrong since i
would like to squueze as much juice out of them processors.

Background:
----------

After mucking with prefetching for some time a while back and observing
inconsistent behavior, I came across a patch from David Mosberger (BTW,
not sure if his email address above is correct - maybe someone could
forward him this discussion). With this specific patch (attached), in
all the Xeons i tested on (~2 years back) there was consistent
improvement of anywhere between 5-10Kpps in the forwarding tests if i
turned only in the AGGRESSIVE mode. Nothing useful when i used the other
mode.
On P3 class hardware prefetching without aggressive mode resulted in
decreased throughput but also decreased CPU occupancy. No effect with
aggressive mode. So the overall effect was it was not harmful to have
aggressive mode on. But i am not sure i even trust that would have been
the case in every piece of hardware i could find.

Lennert and I as well sweated over testing some simple kernel modules
with very inconsistent results if he used his laptop or i used mine.

I will test with a newer piece of hardware and one of the older ones i
have (both Xeons) - perhaps this weekend.
Robert may have some results perhaps on this driver, Robert?
It would also be nice for the intel folks to post their full results
somewhere.
 
Having said all the above:
--------------------------

The danger of prefetching is the close dependence on the "stride" (or
load-latency as David Mosberger would call it). 

If somehow the amount of CPU cycles executed to where the data is really
used happens to be less than the amount of time it would take to fetch
the line into cache, then it is useless and maybe detrimental. 
If you have small cache (It is hard to imagine that i just bought a
cheap AMD-64 _today_ with only 64KB L1 cache and 512KB L2) then
prefetching too early may be detrimental because it evicts other useful
data which you are using. If you have huge cache this may not happen
depending on the workload but it may result in no perfomance
improvement. If on the other hand you prefetch at a distance less than
the stride you are still gonna stall in any case.

The repurcasions are that a change in the driver that used to work may
result in degradation if it changes the "stride distance"; i.e it is
voodoo instead of a science since there is nothing in the compiler or at
runtime that will tell you "you are placing that prefetch too close".

On the copybreak change
-----------------------

Lets defer this since i already said a lot on prefetch ;->
But someone like Robert with results could shed more light.

I rest my case.

cheers,
jamal


===== drivers/net/e1000/e1000_main.c 1.134 vs edited =====
--- 1.134/drivers/net/e1000/e1000_main.c        2004-09-12 16:52:48 -07:00
+++ edited/drivers/net/e1000/e1000_main.c       2004-09-30 06:05:11 -07:00
@@ -2278,12 +2278,30 @@
        uint8_t last_byte;
        unsigned int i;
        boolean_t cleaned = FALSE;
+#define AGGRESSIVE 1
 
        i = rx_ring->next_to_clean;
+#if AGGRESSIVE
+       prefetch(rx_ring->buffer_info[i].skb->data);
+#endif
        rx_desc = E1000_RX_DESC(*rx_ring, i);
 
        while(rx_desc->status & E1000_RXD_STAT_DD) {
                buffer_info = &rx_ring->buffer_info[i];
+# if AGGRESSIVE
+               {
+                       struct e1000_rx_desc *next_rx;
+                       unsigned int j = i + 1;
+
+                       if (j == rx_ring->count)
+                               j = 0;
+                       next_rx = E1000_RX_DESC(*rx_ring, j);
+                       if (next_rx->status & E1000_RXD_STAT_DD)
+                               prefetch(rx_ring->buffer_info[j].skb->data);
+               }
+# else
+               prefetch(buffer_info->skb->data);
+# endif
 #ifdef CONFIG_E1000_NAPI
                if(*work_done >= work_to_do)
                        break;


Reply via email to