Hi James, I like the idea of staying in poll longer.
My comments are similar to what Jamal and Stephen have already said. A tunable (via sysfs) would be nice. A timer might be preferred to jiffy polling. Jiffy polling will not increase latency the way a timer would. However, jiffy polling will consume a lot more CPU than a timer would. Hence more power. For jiffy polling, you could have thousands of calls to poll for a single packet received. While in a timer approach the numbers of polls per packet is upper bound to 2. I think it may difficult to make poll efficient for the no packet case because, at a minimum, you have to poll the device state via the has_work method. If you go to a timer implementation then having a tunable will be important. Different appications will have different requirements on delay and jitter. Some applications may want to trade delay/jitter for less CPU/power consumption and some may not. imho, the work should definately be pursued further:) Regards, Mandeep James Chapman ([EMAIL PROTECTED]) wrote: > This RFC suggests some possible improvements to NAPI in the area of > minimizing interrupt rates. A possible scheme to reduce interrupt rate for > the low packet rate / fast CPU case is described. > > First, do we need to encourage consistency in NAPI poll drivers? A survey of > current NAPI drivers shows different strategies being used in their poll(). > Some such as r8169 do the napi_complete() if poll() does less work than their > allowed budget. Others such as e100 and tg3 do napi_complete() only if they > do no work at all. And some drivers use NAPI only for receive handling, > perhaps setting txdone interrupts for 1 in N transmitted packets, while > others do all "interrupt" processing in their poll(). Should we encourage > more consistency? Should we encourage more NAPI driver maintainers to > minimize interrupts by doing all rx _and_ tx processing in the poll(), and do > napi_complete() only when the poll does _no_ work? > > One well known issue with NAPI is that it is possible with certain traffic > patterns for NAPI drivers to schedule in and out of polled mode very quickly. > Worst case, a NAPI driver might get 1 interrupt per packet. With fast CPUs > and interfaces, this can happen at high rates, causing high CPU loads and > poor packet processing performance. Some drivers avoid this by using hardware > interrupt mitigation features of the network device in tandem with NAPI to > throttle the max interrupt rate per device. But this adds latency. Jamal's > paper http://kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf > discusses this problem in some detail. > > By making some small changes to the NAPI core, I think it is possible to > prevent high interrupt rates with NAPI, regardless of traffic patterns and > without using per-device hardware interrupt mitigation. The basic idea is > that instead of immediately exiting polled mode when it finds no work to do, > the driver's poll() keeps itself in active polled mode for 1-2 jiffies and > only does napi_complete() when it does no work in that time period. When it > does no work in its poll(), the driver can return 0 while leaving itself in > the NAPI poll list. This means it is possible for the softirq processing to > spin around its active device list, doing no work, since no quota is > consumed. A change is therefore also needed in the NAPI core to detect the > case when the only devices that are being actively polled in softirq > processing are doing no work on each poll and to exit the softirq loop rather > than wasting CPU cycles. > > The code changes are shown in the patch below. The patch is against the > latest NAPI rework posted by DaveM > http://marc.info/?l=linux-netdev&m=118829721407289&w=2. I used e100 and tg3 > drivers to test. Since a driver that returns 0 from its poll() while leaving > itself in polled mode would now used by the NAPI core as a condition for > exiting the softirq poll loop, all existing NAPI drivers would need to > conform to this new invariant. Some drivers, e.g. e100, can return 0 even if > they do tx work in their poll(). > > Clearly, keeping a device in polled mode for 1-2 jiffies after it would > otherwise have gone idle means that it might be called many times by the NAPI > softirq while it has no work to do. This wastes CPU cycles. It would be > important therefore to implement the driver's poll() to make this case as > efficient as possible, perhaps testing for it early. > > When a device is in polled mode while idle, there are 2 scheduling cases to > consider:- > > 1. One or more other netdevs is not idle and is consuming quota on each poll. > The net_rx softirq will loop until the next jiffy tick or when quota is > exceeded, calling each device in its polled list. Since the idle device is > still in the poll list, it will be polled very rapidly. > > 2. No other active device is in the poll list. The net_rx softirq will poll > the idle device twice and then exit the softirq processing loop as if quota > is exceeded. See the net_rx_action() changes in the patch which force the > loop to exit if no work is being done by any device in the poll list. > > In both cases described above, the scheduler will continue NAPI processing > from ksoftirqd. This might be very soon, especially if the system is > otherwise idle. But if the system is idle, do we really care that idle > network devices will be polled for 1-2 jiffies? If the system is otherwise > busy, ksoftirqd will share the CPU with other threads/processes which will > reduce the poll rate anyway. > > In testing, I see significant reduction in interrupt rate for typical traffic > patterns. A flood ping, for example, keeps the device in polled mode, > generating no interrupts. In a test, 8510 packets are sent/received versus > 6200 previously; CPU load is 100% versus 62% previously; and 1 netdev > interrupt occurs versus 12400 previously. Performance and CPU load under > extreme network load (using pktgen) is unchanged, as expected. Most > importantly though, it is no longer possible to find a combination of CPU > performance and traffic pattern that induce high interrupt rates. And because > hardware interrupt mitigation isn't used, packet latency is minimized. > > The increase in CPU load isn't surprising for a flood ping test since the CPU > is working to bounce packets as fast as it can. The increase in packet rate > is a good indicator of how much the interrupt and NAPI scheduling overhead > is. The CPU load shows 100% because ksoftirqd is always wanting the CPU for > the duration of the flood ping. The beauty of NAPI is that the scheduler gets > to decide which thread gets the CPU, not hardware CPU interrupt priorities. > On my desktop system, I perceive _better_ system response (smoother X cursor > movement etc) during the flood ping test, despite the CPU load being > increased. For a system whose main job is processing network traffic quickly, > like an embedded router or a network server, this approach might be very > beneficial. For a desktop, I'm less sure, although as I said above, I've > noticed no performance issues in my setups to date. > > > Is this worth pursuing further? I'm considering doing more work to measure > the effects at various relatively low packet rates. I also want to > investigate using High Res Timers rather than jiffy sampling to reduce the > idle poll time. Perhaps it is also worth trying HRT in the net_rx softirq > too. I thought it would be worth throwing the ideas out there first to get > early feedback. > > Here's the patch. > > Index: linux-2.6/drivers/net/e100.c > =================================================================== > --- linux-2.6.orig/drivers/net/e100.c > +++ linux-2.6/drivers/net/e100.c > @@ -544,6 +544,7 @@ struct nic { > struct cb *cb_to_use; > struct cb *cb_to_send; > struct cb *cb_to_clean; > + unsigned long exit_poll_time; > u16 tx_command; > /* End: frequently used values: keep adjacent for cache effect */ > > @@ -1993,12 +1994,35 @@ static int e100_poll(struct napi_struct > e100_rx_clean(nic, &work_done, budget); > tx_cleaned = e100_tx_clean(nic); > > - /* If no Rx and Tx cleanup work was done, exit polling mode. */ > - if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) { > + if (!netif_running(netdev)) { > netif_rx_complete(netdev, napi); > e100_enable_irq(nic); > + return 0; > } > > + /* Stay in polled mode if we do any tx cleanup */ > + if (tx_cleaned) > + work_done++; > + > + /* If no Rx and Tx cleanup work was done, exit polling mode if > + * we've seen no work for 1-2 jiffies. > + */ > + if (work_done == 0) { > + if (nic->exit_poll_time) { > + if (time_after(jiffies, nic->exit_poll_time)) { > + nic->exit_poll_time = 0; > + netif_rx_complete(netdev, napi); > + e100_enable_irq(nic); > + } > + } else { > + nic->exit_poll_time = jiffies + 2; > + } > + return 0; > + } > + > + /* Otherwise, reset poll exit time and stay in poll list */ > + nic->exit_poll_time = 0; > + > return work_done; > } > > Index: linux-2.6/drivers/net/tg3.c > =================================================================== > --- linux-2.6.orig/drivers/net/tg3.c > +++ linux-2.6/drivers/net/tg3.c > @@ -3473,6 +3473,24 @@ static int tg3_poll(struct napi_struct * > struct tg3_hw_status *sblk = tp->hw_status; > int work_done = 0; > > + /* fastpath having no work while we're holding ourself in > + * polled mode > + */ > + if ((tp->exit_poll_time) && (!tg3_has_work(tp))) { > + if (time_after(jiffies, tp->exit_poll_time)) { > + tp->exit_poll_time = 0; > + /* tell net stack and NIC we're done */ > + netif_rx_complete(netdev, napi); > + tg3_restart_ints(tp); > + } > + return 0; > + } > + > + /* if we get here, there might be work to do so disable the > + * poll hold fastpath above > + */ > + tp->exit_poll_time = 0; > + > /* handle link change and other phy events */ > if (!(tp->tg3_flags & > (TG3_FLAG_USE_LINKCHG_REG | > @@ -3511,11 +3529,11 @@ static int tg3_poll(struct napi_struct * > } else > sblk->status &= ~SD_STATUS_UPDATED; > > - /* if no more work, tell net stack and NIC we're done */ > - if (!tg3_has_work(tp)) { > - netif_rx_complete(netdev, napi); > - tg3_restart_ints(tp); > - } > + /* if no more work, set the time in jiffies when we should > + * exit polled mode > + */ > + if (!tg3_has_work(tp)) > + tp->exit_poll_time = jiffies + 2; > > return work_done; > } > Index: linux-2.6/drivers/net/tg3.h > =================================================================== > --- linux-2.6.orig/drivers/net/tg3.h > +++ linux-2.6/drivers/net/tg3.h > @@ -2163,6 +2163,7 @@ struct tg3 { > u32 last_tag; > > u32 msg_enable; > + unsigned long exit_poll_time; > > /* begin "tx thread" cacheline section */ > void (*write32_tx_mbox) (struct tg3 *, u32, > Index: linux-2.6/net/core/dev.c > =================================================================== > --- linux-2.6.orig/net/core/dev.c > +++ linux-2.6/net/core/dev.c > @@ -2073,6 +2073,8 @@ static void net_rx_action(struct softirq > unsigned long start_time = jiffies; > int budget = netdev_budget; > void *have; > + struct napi_struct *last_hold = NULL; > + int done = 0; > > local_irq_disable(); > list_replace_init(&__get_cpu_var(softnet_data).poll_list, &list); > @@ -2082,7 +2084,7 @@ static void net_rx_action(struct softirq > struct napi_struct *n; > > /* if softirq window is exhuasted then punt */ > - if (unlikely(budget <= 0 || jiffies != start_time)) { > + if (unlikely(budget <= 0 || jiffies != start_time || done)) { > local_irq_disable(); > list_splice(&list, > &__get_cpu_var(softnet_data).poll_list); > __raise_softirq_irqoff(NET_RX_SOFTIRQ); > @@ -2096,12 +2098,28 @@ static void net_rx_action(struct softirq > > list_del(&n->poll_list); > > - /* if quota not exhausted process work */ > + /* if quota not exhausted process work. We special > + * case on n->poll() returning 0 here when the driver > + * doesn't do a napi_complete(), which indicates that > + * the device wants to stay on the poll list although > + * it did no work. We remember the first device to go > + * into this state in order to terminate this loop if > + * we see the same device again without doing any > + * other work. > + */ > if (likely(n->quota > 0)) { > int work = n->poll(n, min(budget, n->quota)); > > - budget -= work; > - n->quota -= work; > + if (likely(work)) { > + budget -= work; > + n->quota -= work; > + last_hold = NULL; > + } else if (test_bit(NAPI_STATE_SCHED, &n->state)) { > + if (unlikely(n == last_hold)) > + done = 1; > + if (likely(!last_hold)) > + last_hold = n; > + } > } > > /* if napi_complete not called, reschedule */ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html