On Mon, 4 Apr 2016 13:00:34 -0700 Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote:
> As seen in 'perf report' from patch 5: > 3.32% ksoftirqd/1 [kernel.vmlinux] [k] sk_load_byte_positive_offset > this is 14Mpps and 4 assembler instructions in the above function > are consuming 3% of the cpu. At this level we also need to take into account the cost/overhead of a function call. Which I've measured to between 5-7 cycles, part of my time_bench_sample[1] test. > Making new_load_byte to be single x86 insn would be really cool. > > Of course, there are other pieces to accelerate: > 12.71% ksoftirqd/1 [mlx4_en] [k] mlx4_en_alloc_frags > 6.87% ksoftirqd/1 [mlx4_en] [k] mlx4_en_free_frag > 4.20% ksoftirqd/1 [kernel.vmlinux] [k] get_page_from_freelist > 4.09% swapper [mlx4_en] [k] mlx4_en_process_rx_cq > and I think Jesper's work on batch allocation is going help that a lot. Actually, it looks like all of this "overhead" comes from the page alloc/free (+ dma unmap/map). We would need a page-pool recycle mechanism to solve/remove this overhead. For the early drop case we might be able to hack recycle the page directly in the driver (and also avoid dma_unmap/map cycle). [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer