Hi Eric, I've been investigating significant tc filter insertion rate degradation and it seems it is caused by your commit 001c96db0181 ("net: align gnet_stats_basic_cpu struct"). With this commit insertion rate is reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules from file in tc batch mode on my machine.
Tc perf profile indicates that pcpu allocator now consumes 2x CPU: 1) Before: Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 Children Self Co Shared Object Symbol + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area 2) After: Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 Children Self Co Shared Object Symbol + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area It seems that it takes much more work for pcpu allocator to perform allocation with new stricter alignment requirements. Not sure if it is expected behavior or not in this case. Regards, Vlad