On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vla...@mellanox.com> wrote:
>
> Hi Eric,
>
> I've been investigating significant tc filter insertion rate degradation
> and it seems it is caused by your commit 001c96db0181 ("net: align
> gnet_stats_basic_cpu struct"). With this commit insertion rate is
> reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> from file in tc batch mode on my machine.
>
> Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
>
> 1) Before:
>
> Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
>   Children      Self  Co  Shared Object     Symbol
> +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>
> 2) After:
>
> Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
>   Children      Self  Co  Shared Object     Symbol
> +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>
> It seems that it takes much more work for pcpu allocator to perform
> allocation with new stricter alignment requirements. Not sure if it is
> expected behavior or not in this case.
>
> Regards,
> Vlad

Hi Vlad

I guess this is more a question for per-cpu allocator experts / maintainers ?

16-bytes alignment for 16-bytes objects sound quite reasonable [1]

It also means that if your workload is mostly being able to setup /
dismantle tc filters,
instead of really using them, you might go back to atomics instead of
expensive per cpu storage.

(Ie optimize control path instead of data path)

Thanks !

[1] We even might make this generic as in :

diff --git a/mm/percpu.c b/mm/percpu.c
index 
27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7
100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size,
size_t align, bool reserved,
         */
        if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
                align = PCPU_MIN_ALLOC_SIZE;
-
+       while (align < L1_CACHE_BYTES && (align << 1) <= size) {
+               if (size % (align << 1))
+                       break;
+               align <<= 1;
+       }
        size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
        bits = size >> PCPU_MIN_ALLOC_SHIFT;
        bit_align = align >> PCPU_MIN_ALLOC_SHIFT;

Reply via email to