On Thu, 2005-01-12 at 16:16 -0800, David S. Miller wrote: > From: "Ronciak, John" <[EMAIL PROTECTED]> > Date: Thu, 1 Dec 2005 15:10:19 -0800 > > > Do you have a specific example? We have tried this on Intel (x86 and > > IA-64), AMD (x86 and x86_64) and PPC processors. Most show gains with > > none showing any detriment to performance with the test cases. The > > gains seen on the ones that show gains are significant in almost all > > cases. > > > > If you have a specific case where you think that there are measurable > > problems with it we would be happy to hear it so we can take a look at > > it. > > I think until a counter case is shown, the prefetches should > remain on unconditionally. Proof of detriment is the burdon > of the accusor, especially since the Intel folks aparently > did a lot of testing :-) >
So let me take the liberty of first cc-ing all the usual suspects i have talked to about this in the past when i attempted to use prefetching on e1000. Then i will try to make a case and eventually collect some newer evidence ;-> I will be happy actually if it turns out i am wrong since i would like to squueze as much juice out of them processors. Background: ---------- After mucking with prefetching for some time a while back and observing inconsistent behavior, I came across a patch from David Mosberger (BTW, not sure if his email address above is correct - maybe someone could forward him this discussion). With this specific patch (attached), in all the Xeons i tested on (~2 years back) there was consistent improvement of anywhere between 5-10Kpps in the forwarding tests if i turned only in the AGGRESSIVE mode. Nothing useful when i used the other mode. On P3 class hardware prefetching without aggressive mode resulted in decreased throughput but also decreased CPU occupancy. No effect with aggressive mode. So the overall effect was it was not harmful to have aggressive mode on. But i am not sure i even trust that would have been the case in every piece of hardware i could find. Lennert and I as well sweated over testing some simple kernel modules with very inconsistent results if he used his laptop or i used mine. I will test with a newer piece of hardware and one of the older ones i have (both Xeons) - perhaps this weekend. Robert may have some results perhaps on this driver, Robert? It would also be nice for the intel folks to post their full results somewhere. Having said all the above: -------------------------- The danger of prefetching is the close dependence on the "stride" (or load-latency as David Mosberger would call it). If somehow the amount of CPU cycles executed to where the data is really used happens to be less than the amount of time it would take to fetch the line into cache, then it is useless and maybe detrimental. If you have small cache (It is hard to imagine that i just bought a cheap AMD-64 _today_ with only 64KB L1 cache and 512KB L2) then prefetching too early may be detrimental because it evicts other useful data which you are using. If you have huge cache this may not happen depending on the workload but it may result in no perfomance improvement. If on the other hand you prefetch at a distance less than the stride you are still gonna stall in any case. The repurcasions are that a change in the driver that used to work may result in degradation if it changes the "stride distance"; i.e it is voodoo instead of a science since there is nothing in the compiler or at runtime that will tell you "you are placing that prefetch too close". On the copybreak change ----------------------- Lets defer this since i already said a lot on prefetch ;-> But someone like Robert with results could shed more light. I rest my case. cheers, jamal
===== drivers/net/e1000/e1000_main.c 1.134 vs edited ===== --- 1.134/drivers/net/e1000/e1000_main.c 2004-09-12 16:52:48 -07:00 +++ edited/drivers/net/e1000/e1000_main.c 2004-09-30 06:05:11 -07:00 @@ -2278,12 +2278,30 @@ uint8_t last_byte; unsigned int i; boolean_t cleaned = FALSE; +#define AGGRESSIVE 1 i = rx_ring->next_to_clean; +#if AGGRESSIVE + prefetch(rx_ring->buffer_info[i].skb->data); +#endif rx_desc = E1000_RX_DESC(*rx_ring, i); while(rx_desc->status & E1000_RXD_STAT_DD) { buffer_info = &rx_ring->buffer_info[i]; +# if AGGRESSIVE + { + struct e1000_rx_desc *next_rx; + unsigned int j = i + 1; + + if (j == rx_ring->count) + j = 0; + next_rx = E1000_RX_DESC(*rx_ring, j); + if (next_rx->status & E1000_RXD_STAT_DD) + prefetch(rx_ring->buffer_info[j].skb->data); + } +# else + prefetch(buffer_info->skb->data); +# endif #ifdef CONFIG_E1000_NAPI if(*work_done >= work_to_do) break;