On Fri, May 11, 2018 at 10:44 AM, Gao Feng <gfree.w...@vip.163.com> wrote: > At 2018-05-11 21:23:55, "Willem de Bruijn" <willemdebruijn.ker...@gmail.com> > wrote: >>On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.w...@vip.163.com> wrote: >>> At 2018-05-11 11:54:55, "Willem de Bruijn" >>> <willemdebruijn.ker...@gmail.com> wrote: >>>>On Thu, May 10, 2018 at 4:28 AM, <gfree.w...@vip.163.com> wrote: >>>>> From: Gao Feng <gfree.w...@vip.163.com> >>>>> >>>>> The skb flow limit is implemented for each CPU independently. In the >>>>> current codes, the function skb_flow_limit gets the softnet_data by >>>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not >>>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks >>>>> the stats of current CPU, while the skb is going to append the queue of >>>>> another CPU. It isn't the expected behavior. >>>>> >>>>> Now pass the softnet_data as a param to softnet_data to make consistent. >>>> >>>>The local cpu softnet_data is used on purpose. The operations in >>>>skb_flow_limit() on sd fields could race if not executed on the local cpu. >>> >>> I think the race doesn't exist because of the rps_lock. >>> The enqueue_to_backlog has hold the rps_lock before skb_flow_limit. >> >>Indeed, I overlooked that. There still is the matter of cache contention. > > The cache contention is really important in this case? > I don't think so, because the enqueue_to_backlog have touched and modified > the softnet_stat > of target cpu. > >> >>>>Flow limit tries to detect large ("elephant") DoS flows with a fixed >>>>four-tuple. >>>>These would always hit the same RPS cpu, so that cpu being backlogged >>> >>> They may hit the different target CPU when enable RFS. Because the app >>> could be scheduled >>> to another CPU, then RFS tries to deliver the skb to latest core which has >>> hot cache. >> >>This even more suggest using the initial (or IRQ) cpu to track state, instead >>of the destination (RPS/RFS) cpu. > > I couldn't understand why it is better to track state on initial cpu, not the > target cpu. > The latter one could get more accurate result.
For a single DoS flow with normal cpu pinned IRQs, the results will be equally good when tracked on the initial IRQ cpu.. > >> >>>>may be an indication that such a flow is active. But the flow will also >>>>always >>>>arrive on the same initial cpu courtesy of RSS. So storing the lookup table >>> >>> The RSS couldn't make sure the irq is handled by same cpu. It would be >>> balanced between >>> the cpus. >> >>IRQs are usually pinned to cores. Unless using something like irqbalance, >>but that operates at too coarse a timescale to do anything useful at Mpps >>packet rates. > > There are some motherboard which couldn't make sure the irq is pinned. > The flow_limit wouldn't work as well as expected. .. this seems to be the crux of the argument. I am not aware of any network interrupts that do not adhere to the cpu pinning configuration in /proc/irq/$IRQ/smp_affinity(_list) What kind of hardware ignores this setting and sprays interrupts? I agree that in that case flow_limit as is may be ineffective (if migration happens at rates comparable to packet rates). But this should not happen? > >> >>>>on the initial CPU is also fine. There may be false positives on other CPUs >>>>with the same RPS destination, but that is unlikely with a highly concurrent >>>>traffic server mix ("mice"). >>> >>> If my comment is right, the flow couldn't always arrive one the same >>> initial cpu, although >>> it may be sent to one same target cpu. >>> >>>> >>>>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature >>>>for the cpus on which traffic initially lands, not the RPS destination cpus. >>>>See also Documentation/networking/scaling.txt >>>> >>>>That said, I had to reread the code, as it does seem sensible that the >>>>same softnet_data is intended to be used both when testing qlen and >>>>flow_limit. >>> >>> In most cases, user configures the same RPS map with flow_limit like 0xff. >>> Because user couldn't predict which core the evil flow would arrive on. >>> >>> Take an example, there are 2 cores, cpu0 and cpu1. >>> One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, >>> the target cpu is cpu1. >>> Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the >>> queue length >>> of cpu0. Certainly it could pass the check of skb_flow_limit because there >>> is no any evil flow on cpu0. >> >>No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does >>check the queue length of the RPS cpu. > > Sorry, I overlooked the qlen is the length of the rps cpu. > Then it's ok unless the stats may be not accurate when irq isn't pinned. > > But I still doubt that is it really important to track state on initial cpu, > not target cpu? > Because the enqueue_to_backlog have touched the softnet_data of target cpu. I think the merit of both IRQ and RPS cpu can be argued for attaching the flow_limit state. Either way, the current behavior is not a bug, so I don't think that this is a candidate for net. The cost of moving from IRQ to RPS cpu will be the cacheline contention on a system with multiple IRQ cpus that all try to update the sd->flow_data of the same RPS cpus. Which is particularly likely with RFS. I suspect that this cost is non-trivial and not worth the benefit of handling hardware with unpinned IRQs.