On Thu 24 Jan 2019 at 17:21, Dennis Zhou <den...@kernel.org> wrote: > Hi Vlad and Eric, > > On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote: >> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vla...@mellanox.com> wrote: >> > >> > Hi Eric, >> > >> > I've been investigating significant tc filter insertion rate degradation >> > and it seems it is caused by your commit 001c96db0181 ("net: align >> > gnet_stats_basic_cpu struct"). With this commit insertion rate is >> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules >> > from file in tc batch mode on my machine. >> > >> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU: >> > >> > 1) Before: >> > >> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 >> > Children Self Co Shared Object Symbol >> > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc >> > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area >> > >> > 2) After: >> > >> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 >> > Children Self Co Shared Object Symbol >> > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc >> > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area >> > >> > It seems that it takes much more work for pcpu allocator to perform >> > allocation with new stricter alignment requirements. Not sure if it is >> > expected behavior or not in this case. >> > >> > Regards, >> > Vlad > > Would you mind sharing a little more information with me: > 1) output before and after a run of /sys/kernel/debug/percpu_stats
Hi Dennis, Some of these files are quite large, so I put them to my Dropbox. Output before: Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 262144 static_size : 139160 reserved_size : 8192 dyn_size : 28776 atom_size : 2097152 alloc_size : 2097152 Global Stats: ---------------------------------------- nr_alloc : 3343 nr_dealloc : 752 nr_cur_alloc : 2591 nr_max_alloc : 2598 nr_chunks : 3 nr_max_chunks : 3 min_alloc_size : 4 max_alloc_size : 8208 empty_pop_pages : 3 Per Chunk Stats: ---------------------------------------- Chunk: <- Reserved Chunk nr_alloc : 5 max_alloc_size : 320 empty_pop_pages : 0 first_bit : 1002 free_bytes : 7448 contig_bytes : 7424 sum_frag : 24 max_frag : 24 cur_min_alloc : 16 cur_med_alloc : 64 cur_max_alloc : 320 Chunk: <- First Chunk nr_alloc : 479 max_alloc_size : 8208 empty_pop_pages : 0 first_bit : 8192 free_bytes : 0 contig_bytes : 0 sum_frag : 0 max_frag : 0 cur_min_alloc : 4 cur_med_alloc : 24 cur_max_alloc : 8208 Chunk: nr_alloc : 1925 max_alloc_size : 8208 empty_pop_pages : 0 first_bit : 63102 free_bytes : 852 contig_bytes : 12 sum_frag : 852 max_frag : 12 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 8208 Chunk: nr_alloc : 182 max_alloc_size : 936 empty_pop_pages : 3 first_bit : 21 free_bytes : 256452 contig_bytes : 255120 sum_frag : 1332 max_frag : 368 cur_min_alloc : 8 cur_med_alloc : 20 cur_max_alloc : 320 After: https://www.dropbox.com/s/unyzhx4vgo2x30e/stats_after?dl=0 > 2) a full perf output https://www.dropbox.com/s/isfcxca3npn5slx/perf.data?dl=0 > 3) a reproducer $ sudo tc -b add.0 Example batch file: https://www.dropbox.com/s/ey7cbl5nwu5p0tg/add.0?dl=0 Thanks, Vlad