On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vla...@mellanox.com> wrote: > > Hi Eric, > > I've been investigating significant tc filter insertion rate degradation > and it seems it is caused by your commit 001c96db0181 ("net: align > gnet_stats_basic_cpu struct"). With this commit insertion rate is > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules > from file in tc batch mode on my machine. > > Tc perf profile indicates that pcpu allocator now consumes 2x CPU: > > 1) Before: > > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 > Children Self Co Shared Object Symbol > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area > > 2) After: > > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 > Children Self Co Shared Object Symbol > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area > > It seems that it takes much more work for pcpu allocator to perform > allocation with new stricter alignment requirements. Not sure if it is > expected behavior or not in this case. > > Regards, > Vlad
Hi Vlad I guess this is more a question for per-cpu allocator experts / maintainers ? 16-bytes alignment for 16-bytes objects sound quite reasonable [1] It also means that if your workload is mostly being able to setup / dismantle tc filters, instead of really using them, you might go back to atomics instead of expensive per cpu storage. (Ie optimize control path instead of data path) Thanks ! [1] We even might make this generic as in : diff --git a/mm/percpu.c b/mm/percpu.c index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, */ if (unlikely(align < PCPU_MIN_ALLOC_SIZE)) align = PCPU_MIN_ALLOC_SIZE; - + while (align < L1_CACHE_BYTES && (align << 1) <= size) { + if (size % (align << 1)) + break; + align <<= 1; + } size = ALIGN(size, PCPU_MIN_ALLOC_SIZE); bits = size >> PCPU_MIN_ALLOC_SHIFT; bit_align = align >> PCPU_MIN_ALLOC_SHIFT;