Re: [PATCH net-next 0/5] net: add protocol level recvmmsg support

Jesper Dangaard Brouer Mon, 28 Nov 2016 05:53:36 -0800

On Mon, 28 Nov 2016 13:21:41 +0100 Jesper Dangaard Brouer <[email protected]> 
wrote:
> On Mon, 28 Nov 2016 11:52:38 +0100 Paolo Abeni <[email protected]> wrote:
> >   
> > > > [2] like [1], but using the minimum number of flows to saturate the 
> > > > user space
> > > >  sink, that is 1 flow for the old kernel and 3 for the patched one.
> > > >  the tput increases since the contention on the rx lock is low.
> > > > [3] like [1] but using a single flow with both old and new kernel. All 
> > > > the
> > > >  packets land on the same rx queue and there is a single ksoftirqd 
> > > > instance
> > > >  running    
[...]
> > 
> > We also used connected socket for test[3], with relative little
> > difference (the tput increased for both unpatched and patched kernel, 
> > and the difference was roughly the same).  
> 
> When I use connected sockets (RX side) and ip_early_demux enabled, I do
> see a performance boost for recvmmsg.  With these patches applied,
> forced ksoftirqd on CPU0 and udp_sink on CPU2, pktgen single flow
> sending size 1472 bytes.
> 
> $ sysctl net/ipv4/ip_early_demux
> net.ipv4.ip_early_demux = 1
> 
> $ grep -H . /proc/sys/net/core/{r,w}mem_max
> /proc/sys/net/core/rmem_max:1048576
> /proc/sys/net/core/wmem_max:1048576
> 
> # taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1
> #                               ns      pps             cycles
> recvMmsg/32   run: 0 10000000 462.51  2162095.23      1853
> recvmsg       run: 0 10000000 536.47  1864041.75      2150
> read          run: 0 10000000 492.01  2032460.71      1972
> recvfrom      run: 0 10000000 553.94  1805262.84      2220
> 
> # taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1 --connect
> #                               ns      pps             cycles
> recvMmsg/32   run: 0 10000000 405.15  2468225.03      1623
> recvmsg       run: 0 10000000 548.23  1824049.58      2197
> read          run: 0 10000000 489.76  2041825.27      1962
> recvfrom      run: 0 10000000 466.18  2145091.77      1868
> 
> My theory is that by enabling connect'ed RX socket, the ksoftirqd gets
> faster (no fib_lookup) and is no-longer a bottleneck.  This is
> confirmed by nstat.


Paolo asked me to do a test with small packets with pktgen, and I was
actually surprised by the result.

# taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1 --connect
recvMmsg/32     run: 0 10000000 426.61  2344076.59      1709    17098657328
recvmsg         run: 0 10000000 533.49  1874449.82      2138    21382574965
read            run: 0 10000000 470.22  2126651.13      1884    18846797802
recvfrom        run: 0 10000000 513.74  1946499.83      2059    20591095477

Notice how recvMmsg/32, got slower with 124kpps (2468225 pps -> 2344076 pps).
I was expecting it to get faster, given we just established udp_sink
was the bottleneck, and smaller packet should mean less copy of bytes
to userspace (copy_user_enhanced_fast_string). (With nstat I observe
ksoftirq is again the bottleneck).

Looking at perf diff of CPU2 (baseline=64Bytes) we do see an increase
of copy_user_enhanced_fast_string.  More interestingly we see a
decrease in the locking cost when using big packets (see ** below)

# Event 'cycles:ppp'
#
# Baseline    Delta  Shared Object     Symbol                                   
# ........  .......  ................  .........................................
#
    15.09%   +0.33%  [kernel.vmlinux]  [k] copy_msghdr_from_user
    12.36%  +21.89%  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
     8.65%   -0.63%  [kernel.vmlinux]  [k] udp_process_skb
     7.33%   -1.88%  [kernel.vmlinux]  [k] __skb_try_recv_datagram_batch
 **  7.12%   -6.66%  [kernel.vmlinux]  [k] udp_rmem_release **
 **  6.71%   -6.52%  [kernel.vmlinux]  [k] _raw_spin_lock_bh **
     6.35%   +1.36%  [kernel.vmlinux]  [k] __free_page_frag
     4.39%   +0.29%  [kernel.vmlinux]  [k] copy_msghdr_to_user_gen
     2.87%   -1.52%  [kernel.vmlinux]  [k] skb_release_data
     2.60%   +0.14%  [kernel.vmlinux]  [k] __put_user_4
     2.27%   -2.18%  [kernel.vmlinux]  [k] __sk_mem_reduce_allocated
     2.11%   +0.08%  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.68
     1.90%   +2.40%  [kernel.vmlinux]  [k] __slab_free
     1.73%   +0.20%  [kernel.vmlinux]  [k] __udp_recvmmsg
     1.62%   -1.62%  [kernel.vmlinux]  [k] intel_idle
     1.52%   +0.22%  [kernel.vmlinux]  [k] copy_to_iter
     1.20%   -0.03%  [kernel.vmlinux]  [k] import_iovec
     1.14%   +0.05%  [kernel.vmlinux]  [k] rw_copy_check_uvector
     0.80%   -0.04%  [kernel.vmlinux]  [k] recvmmsg_ctx_to_user
     0.75%   -0.69%  [kernel.vmlinux]  [k] __local_bh_enable_ip
     0.71%   +0.18%  [kernel.vmlinux]  [k] skb_copy_datagram_iter
     0.70%   -0.07%  [kernel.vmlinux]  [k] recvmmsg_ctx_from_user
     0.67%   +0.08%  [kernel.vmlinux]  [k] kmem_cache_free
     0.56%   +0.42%  [kernel.vmlinux]  [k] udp_process_msg
     0.48%   +0.05%  [kernel.vmlinux]  [k] skb_release_head_state
     0.46%           [kernel.vmlinux]  [k] lapic_next_deadline
     0.36%           [kernel.vmlinux]  [k] __switch_to
     0.34%   -0.03%  [kernel.vmlinux]  [k] consume_skb
     0.32%   -0.05%  [kernel.vmlinux]  [k] skb_consume_udp


The perf diff from CPU0, also show less lock congestion:

# Event 'cycles:ppp'
#
# Baseline    Delta  Shared Object     Symbol                                   
# ........  .......  ................  .........................................
#
    11.04%   -3.02%  [kernel.vmlinux]  [k] __udp_enqueue_schedule_skb
     9.98%   +2.16%  [mlx5_core]       [k] mlx5e_handle_rx_cqe
     7.23%   -1.85%  [kernel.vmlinux]  [k] udp_v4_early_demux
     3.90%   +0.73%  [kernel.vmlinux]  [k] build_skb
     3.85%   -1.77%  [kernel.vmlinux]  [k] udp_queue_rcv_skb
     3.83%   +0.02%  [kernel.vmlinux]  [k] sock_def_readable
 **  3.26%   -3.19%  [kernel.vmlinux]  [k] queued_spin_lock_slowpath **
     2.99%   +0.55%  [kernel.vmlinux]  [k] __build_skb
     2.97%   +0.11%  [kernel.vmlinux]  [k] __udp4_lib_rcv
 **  2.87%   -1.39%  [kernel.vmlinux]  [k] _raw_spin_lock **
     2.67%   +0.60%  [kernel.vmlinux]  [k] ip_rcv
     2.65%   +0.61%  [kernel.vmlinux]  [k] __netif_receive_skb_core
     2.64%   +0.79%  [ip_tables]       [k] ipt_do_table
     2.37%   +0.37%  [kernel.vmlinux]  [k] read_tsc
     2.26%   +0.52%  [mlx5_core]       [k] mlx5e_get_cqe
     2.11%   -1.15%  [kernel.vmlinux]  [k] __sk_mem_raise_allocated
     2.10%   +0.37%  [kernel.vmlinux]  [k] __rcu_read_unlock
     2.04%   +0.67%  [mlx5_core]       [k] mlx5e_alloc_rx_wqe
     1.86%   +0.40%  [kernel.vmlinux]  [k] inet_gro_receive
     1.57%   +0.11%  [kernel.vmlinux]  [k] kmem_cache_alloc
     1.53%   +0.28%  [kernel.vmlinux]  [k] _raw_read_lock
     1.53%   +0.25%  [kernel.vmlinux]  [k] dev_gro_receive
     1.38%   -0.18%  [kernel.vmlinux]  [k] udp_gro_receive
     1.19%   +0.37%  [kernel.vmlinux]  [k] __rcu_read_lock
     1.14%   +0.31%  [kernel.vmlinux]  [k] _raw_read_unlock
     1.14%   +0.12%  [kernel.vmlinux]  [k] ip_rcv_finish
     1.13%   +0.20%  [kernel.vmlinux]  [k] __udp4_lib_lookup
     1.05%   +0.16%  [kernel.vmlinux]  [k] ktime_get_with_offset
     0.94%   +0.38%  [kernel.vmlinux]  [k] ip_local_deliver_finish
     0.91%   +0.22%  [kernel.vmlinux]  [k] do_csum
     0.86%   -0.04%  [kernel.vmlinux]  [k] ipv4_pktinfo_prepare
     0.84%   +0.05%  [kernel.vmlinux]  [k] sk_filter_trim_cap
     0.84%   +0.20%  [kernel.vmlinux]  [k] ip_local_deliver
     0.84%   +0.19%  [kernel.vmlinux]  [k] udp4_gro_receive

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next 0/5] net: add protocol level recvmmsg support

Reply via email to