Hello, On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote: > > 1) Before: > > > > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 > > Children Self Co Shared Object Symbol > > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc > > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area > > > > 2) After: > > > > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 > > Children Self Co Shared Object Symbol > > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc > > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
The allocator hint only remembers the max available size per chunk but not the alignment, so depending on the allocation pattern, alignment requirement change can lead to way more scanning per alloc attempt. Shouldn't be too difficult to improve tho. > I guess this is more a question for per-cpu allocator experts / maintainers ? > > 16-bytes alignment for 16-bytes objects sound quite reasonable [1] > > It also means that if your workload is mostly being able to setup / > dismantle tc filters, > instead of really using them, you might go back to atomics instead of > expensive per cpu storage. > > (Ie optimize control path instead of data path) > > Thanks ! > > [1] We even might make this generic as in : > > diff --git a/mm/percpu.c b/mm/percpu.c > index > 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7 > 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size, > size_t align, bool reserved, > */ > if (unlikely(align < PCPU_MIN_ALLOC_SIZE)) > align = PCPU_MIN_ALLOC_SIZE; > - > + while (align < L1_CACHE_BYTES && (align << 1) <= size) { > + if (size % (align << 1)) > + break; > + align <<= 1; > + } Percpu storage is expensive and cache line sharing tends to be less of a problem (cuz they're per-cpu), so it is useful to support custom alignments for tighter packing. Thanks. -- tejun