On Thu, 6 Sep 2007 15:16:00 +0100 James Chapman <[EMAIL PROTECTED]> wrote:
> This RFC suggests some possible improvements to NAPI in the area of > minimizing interrupt rates. A possible scheme to reduce interrupt rate for > the low packet rate / fast CPU case is described. > > First, do we need to encourage consistency in NAPI poll drivers? A survey of > current NAPI drivers shows different strategies being used in their poll(). > Some such as r8169 do the napi_complete() if poll() does less work than their > allowed budget. Others such as e100 and tg3 do napi_complete() only if they > do no work at all. And some drivers use NAPI only for receive handling, > perhaps setting txdone interrupts for 1 in N transmitted packets, while > others do all "interrupt" processing in their poll(). Should we encourage > more consistency? Should we encourage more NAPI driver maintainers to > minimize interrupts by doing all rx _and_ tx processing in the poll(), and do > napi_complete() only when the poll does _no_ work? > > One well known issue with NAPI is that it is possible with certain traffic > patterns for NAPI drivers to schedule in and out of polled mode very quickly. > Worst case, a NAPI driver might get 1 interrupt per packet. With fast CPUs > and interfaces, this can happen at high rates, causing high CPU loads and > poor packet processing performance. Some drivers avoid this by using hardware > interrupt mitigation features of the network device in tandem with NAPI to > throttle the max interrupt rate per device. But this adds latency. Jamal's > paper http://kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf > discusses this problem in some detail. > > By making some small changes to the NAPI core, I think it is possible to > prevent high interrupt rates with NAPI, regardless of traffic patterns and > without using per-device hardware interrupt mitigation. The basic idea is > that instead of immediately exiting polled mode when it finds no work to do, > the driver's poll() keeps itself in active polled mode for 1-2 jiffies and > only does napi_complete() when it does no work in that time period. When it > does no work in its poll(), the driver can return 0 while leaving itself in > the NAPI poll list. This means it is possible for the softirq processing to > spin around its active device list, doing no work, since no quota is > consumed. A change is therefore also needed in the NAPI core to detect the > case when the only devices that are being actively polled in softirq > processing are doing no work on each poll and to exit the softirq loop rather > than wasting CPU cycles. > > The code changes are shown in the patch below. The patch is against the > latest NAPI rework posted by DaveM > http://marc.info/?l=linux-netdev&m=118829721407289&w=2. I used e100 and tg3 > drivers to test. Since a driver that returns 0 from its poll() while leaving > itself in polled mode would now used by the NAPI core as a condition for > exiting the softirq poll loop, all existing NAPI drivers would need to > conform to this new invariant. Some drivers, e.g. e100, can return 0 even if > they do tx work in their poll(). > > Clearly, keeping a device in polled mode for 1-2 jiffies after it would > otherwise have gone idle means that it might be called many times by the NAPI > softirq while it has no work to do. This wastes CPU cycles. It would be > important therefore to implement the driver's poll() to make this case as > efficient as possible, perhaps testing for it early. > > When a device is in polled mode while idle, there are 2 scheduling cases to > consider:- > > 1. One or more other netdevs is not idle and is consuming quota on each poll. > The net_rx softirq will loop until the next jiffy tick or when quota is > exceeded, calling each device in its polled list. Since the idle device is > still in the poll list, it will be polled very rapidly. > > 2. No other active device is in the poll list. The net_rx softirq will poll > the idle device twice and then exit the softirq processing loop as if quota > is exceeded. See the net_rx_action() changes in the patch which force the > loop to exit if no work is being done by any device in the poll list. > > In both cases described above, the scheduler will continue NAPI processing > from ksoftirqd. This might be very soon, especially if the system is > otherwise idle. But if the system is idle, do we really care that idle > network devices will be polled for 1-2 jiffies? If the system is otherwise > busy, ksoftirqd will share the CPU with other threads/processes which will > reduce the poll rate anyway. > > In testing, I see significant reduction in interrupt rate for typical traffic > patterns. A flood ping, for example, keeps the device in polled mode, > generating no interrupts. In a test, 8510 packets are sent/received versus > 6200 previously; CPU load is 100% versus 62% previously; and 1 netdev > interrupt occurs versus 12400 previously. Performance and CPU load under > extreme network load (using pktgen) is unchanged, as expected. Most > importantly though, it is no longer possible to find a combination of CPU > performance and traffic pattern that induce high interrupt rates. And because > hardware interrupt mitigation isn't used, packet latency is minimized. > > The increase in CPU load isn't surprising for a flood ping test since the CPU > is working to bounce packets as fast as it can. The increase in packet rate > is a good indicator of how much the interrupt and NAPI scheduling overhead > is. The CPU load shows 100% because ksoftirqd is always wanting the CPU for > the duration of the flood ping. The beauty of NAPI is that the scheduler gets > to decide which thread gets the CPU, not hardware CPU interrupt priorities. > On my desktop system, I perceive _better_ system response (smoother X cursor > movement etc) during the flood ping test, despite the CPU load being > increased. For a system whose main job is processing network traffic quickly, > like an embedded router or a network server, this approach might be very > beneficial. For a desktop, I'm less sure, although as I said above, I've > noticed no performance issues in my setups to date. > > > Is this worth pursuing further? I'm considering doing more work to measure > the effects at various relatively low packet rates. I also want to > investigate using High Res Timers rather than jiffy sampling to reduce the > idle poll time. Perhaps it is also worth trying HRT in the net_rx softirq > too. I thought it would be worth throwing the ideas out there first to get > early feedback. > What about the latency that NAPI imposes? Right now there are certain applications that don't like NAPI because it add several more microseconds, and this may make it worse. Maybe a per-device flag or tuning parameters (like weight sysfs value)? or some other way to set low-latency values. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html