On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.w...@vip.163.com> wrote: > At 2018-05-11 11:54:55, "Willem de Bruijn" <willemdebruijn.ker...@gmail.com> > wrote: >>On Thu, May 10, 2018 at 4:28 AM, <gfree.w...@vip.163.com> wrote: >>> From: Gao Feng <gfree.w...@vip.163.com> >>> >>> The skb flow limit is implemented for each CPU independently. In the >>> current codes, the function skb_flow_limit gets the softnet_data by >>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not >>> the current cpu when enable RPS. As the result, the skb_flow_limit checks >>> the stats of current CPU, while the skb is going to append the queue of >>> another CPU. It isn't the expected behavior. >>> >>> Now pass the softnet_data as a param to softnet_data to make consistent. >> >>The local cpu softnet_data is used on purpose. The operations in >>skb_flow_limit() on sd fields could race if not executed on the local cpu. > > I think the race doesn't exist because of the rps_lock. > The enqueue_to_backlog has hold the rps_lock before skb_flow_limit.
Indeed, I overlooked that. There still is the matter of cache contention. >>Flow limit tries to detect large ("elephant") DoS flows with a fixed >>four-tuple. >>These would always hit the same RPS cpu, so that cpu being backlogged > > They may hit the different target CPU when enable RFS. Because the app could > be scheduled > to another CPU, then RFS tries to deliver the skb to latest core which has > hot cache. This even more suggest using the initial (or IRQ) cpu to track state, instead of the destination (RPS/RFS) cpu. >>may be an indication that such a flow is active. But the flow will also always >>arrive on the same initial cpu courtesy of RSS. So storing the lookup table > > The RSS couldn't make sure the irq is handled by same cpu. It would be > balanced between > the cpus. IRQs are usually pinned to cores. Unless using something like irqbalance, but that operates at too coarse a timescale to do anything useful at Mpps packet rates. >>on the initial CPU is also fine. There may be false positives on other CPUs >>with the same RPS destination, but that is unlikely with a highly concurrent >>traffic server mix ("mice"). > > If my comment is right, the flow couldn't always arrive one the same initial > cpu, although > it may be sent to one same target cpu. > >> >>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature >>for the cpus on which traffic initially lands, not the RPS destination cpus. >>See also Documentation/networking/scaling.txt >> >>That said, I had to reread the code, as it does seem sensible that the >>same softnet_data is intended to be used both when testing qlen and >>flow_limit. > > In most cases, user configures the same RPS map with flow_limit like 0xff. > Because user couldn't predict which core the evil flow would arrive on. > > Take an example, there are 2 cores, cpu0 and cpu1. > One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, the > target cpu is cpu1. > Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the > queue length > of cpu0. Certainly it could pass the check of skb_flow_limit because there is > no any evil flow on cpu0. No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does check the queue length of the RPS cpu. > Then the cpu0 inserts the skb into the queue of cpu1. > As a result, the skb_flow_limit doesn't work as expected. > > BTW, I have already sent the v2 patch which only adds the "Fixes: tag". The change also makes the code inconsistent with Documentation/networking/scaling.txt "In such environments, enable the feature on all CPUs that handle network rx interrupts (as set in /proc/irq/N/smp_affinity)."