On Mon, 4 Apr 2016 13:00:34 -0700 Alexei Starovoitov 
<alexei.starovoi...@gmail.com> wrote:

> As seen in 'perf report' from patch 5:
>   3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
> this is 14Mpps and 4 assembler instructions in the above function
> are consuming 3% of the cpu.

At this level we also need to take into account the cost/overhead of a
function call.  Which I've measured to between 5-7 cycles, part of my
time_bench_sample[1] test.

> Making new_load_byte to be single  x86 insn would be really cool.
> 
> Of course, there are other pieces to accelerate:
>  12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
>   6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
>   4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
>   4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
> and I think Jesper's work on batch allocation is going help that a lot.

Actually, it looks like all of this "overhead" comes from the page
alloc/free (+ dma unmap/map). We would need a page-pool recycle
mechanism to solve/remove this overhead.  For the early drop case we
might be able to hack recycle the page directly in the driver (and also
avoid dma_unmap/map cycle).


[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Reply via email to