On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote: > On Wed, 27 Sep 2017 06:35:40 -0700 > John Fastabend <john.fastab...@gmail.com> wrote: > > > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote: > > > On Tue, 26 Sep 2017 21:58:53 +0200 > > > Daniel Borkmann <dan...@iogearbox.net> wrote: > > > > > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote: > > >> [...] > > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames > > >>> to another CPU, and the SKB is allocated on the remote CPU. (It > > >>> actually works extremely well). > > >> > > >> Meaning you let all the XDP_PASS packets get processed on a > > >> different CPU, so you can reserve the whole CPU just for > > >> prefiltering, right? > > > > > > Yes, exactly. Except I use the XDP_REDIRECT action to steer packets. > > > The trick is using the map-flush point, to transfer packets in bulk to > > > the remote CPU (single call IPC is too slow), but at the same time > > > flush single packets if NAPI didn't see a bulk. > > > > > >> Do you have some numbers to share at this point, just curious when > > >> you mention it works extremely well. > > > > > > Sure... I've done a lot of benchmarking on this patchset ;-) > > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect > > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have > > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns) > > > > > > [1] > > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c > > > [2] > > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c > > > > > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen > > > packets) on RX CPU=0. I'm forcing my netperf to hit the same CPU, that > > > the 11.9Mpps DDoS attack is hitting. > > > > > > Running XDP/eBPF prog_num:4 > > > XDP-cpumap CPU:to pps drop-pps extra-info > > > XDP-RX 0 12,030,471 11,966,982 0 > > > XDP-RX total 12,030,471 11,966,982 > > > cpumap-enqueue 0:2 63,488 0 0 > > > cpumap-enqueue sum:2 63,488 0 0 > > > cpumap_kthread 2 63,488 0 3 time_exceed > > > cpumap_kthread total 63,488 0 0 > > > redirect_err total 0 0 > > > > > > $ netperf -H 172.16.0.2 -t TCP_CRR -l 10 -D1 -T5,5 -- -r 1024,1024 > > > Local /Remote > > > Socket Size Request Resp. Elapsed Trans. > > > Send Recv Size Size Time Rate > > > bytes Bytes bytes bytes secs. per sec > > > > > > 16384 87380 1024 1024 10.00 12735.97 > > > 16384 87380 > > > > > > The netperf TCP_CRR performance is the same, without XDP loaded. > > > > > > > Just curious could you also try this with RPS enabled (or does this have > > RPS enabled). RPS should effectively do the same thing but higher in the > > stack. I'm curious what the delta would be. Might be another interesting > > case and fairly easy to setup if you already have the above scripts. > > Yes, I'm essentially competing with RSP, thus such a comparison is very > relevant... > > This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let > other 4 CPUS process packet. > > Summary of RPS (Receive Packet Steering) performance: > * End result is 6.3 Mpps max performance > * netperf TCP_CRR is 1 trans/sec. > * Each RX-RPS CPU stall at ~3.2Mpps. > > The full test report below with setup: > > The mask needed:: > > perl -e 'printf "%b\n",0x3C' > 111100 > > RPS setup:: > > sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries' > > for N in $(seq 0 5) ; do \ > sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; > \ > sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \ > grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \ > done > > Reduce RX queues to two :: > > ethtool -L ixgbe1 combined 2 > > IRQ align to CPU numbers:: > > $ ~/setup01.sh > Not root, running with sudo > --- Disable Ethernet flow-control --- > rx unmodified, ignoring > tx unmodified, ignoring > no pause parameters changed, aborting > rx unmodified, ignoring > tx unmodified, ignoring > no pause parameters changed, aborting > --- Align IRQs --- > /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0 > /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1 > /proc/irq/56/ixgbe1/../smp_affinity_list:0-5 > > $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus > /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c > /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c > > Generator is sending: 12,715,782 tx_packets /sec > > ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \ > -d 172.16.0.2 -t8 > > $ nstat > /dev/null && sleep 1 && nstat > #kernel > IpInReceives 6346544 0.0 > IpInDelivers 6346544 0.0 > IpOutRequests 1020 0.0 > IcmpOutMsgs 1020 0.0 > IcmpOutDestUnreachs 1020 0.0 > IcmpMsgOutType3 1020 0.0 > UdpNoPorts 6346898 0.0 > IpExtInOctets 291964714 0.0 > IpExtOutOctets 73440 0.0 > IpExtInNoECTPkts 6347063 0.0 > > $ mpstat -P ALL -u -I SCPU -I SUM > > Average: CPU %usr %nice %sys %irq %soft %idle > Average: all 0.00 0.00 0.00 0.42 72.97 26.61 > Average: 0 0.00 0.00 0.00 0.17 99.83 0.00 > Average: 1 0.00 0.00 0.00 0.17 99.83 0.00 > Average: 2 0.00 0.00 0.00 0.67 60.37 38.96 > Average: 3 0.00 0.00 0.00 0.67 58.70 40.64 > Average: 4 0.00 0.00 0.00 0.67 59.53 39.80 > Average: 5 0.00 0.00 0.00 0.67 58.93 40.40 > > Average: CPU intr/s > Average: all 152067.22 > Average: 0 50064.73 > Average: 1 50089.35 > Average: 2 45095.17 > Average: 3 44875.04 > Average: 4 44906.32 > Average: 5 45152.08 > > Average: CPU TIMER/s NET_TX/s NET_RX/s TASKLET/s SCHED/s > RCU/s > Average: 0 609.48 0.17 49431.28 0.00 2.66 > 21.13 > Average: 1 567.55 0.00 49498.00 0.00 2.66 > 21.13 > Average: 2 998.34 0.00 43941.60 4.16 82.86 > 68.22 > Average: 3 540.60 0.17 44140.27 0.00 85.52 > 108.49 > Average: 4 537.27 0.00 44219.63 0.00 84.53 > 64.89 > Average: 5 530.78 0.17 44445.59 0.00 85.02 > 90.52 > > From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck. > > Show adapter(s) (ixgbe1) statistics (ONLY that changed!) > Ethtool(ixgbe1) stat: 11109531 ( 11,109,531) <= fdir_miss /sec > Ethtool(ixgbe1) stat: 380632356 ( 380,632,356) <= rx_bytes /sec > Ethtool(ixgbe1) stat: 812792611 ( 812,792,611) <= rx_bytes_nic /sec > Ethtool(ixgbe1) stat: 1753550 ( 1,753,550) <= rx_missed_errors /sec > Ethtool(ixgbe1) stat: 4602487 ( 4,602,487) <= rx_no_dma_resources /sec > Ethtool(ixgbe1) stat: 6343873 ( 6,343,873) <= rx_packets /sec > Ethtool(ixgbe1) stat: 10946441 ( 10,946,441) <= rx_pkts_nic /sec > Ethtool(ixgbe1) stat: 190287853 ( 190,287,853) <= rx_queue_0_bytes /sec > Ethtool(ixgbe1) stat: 3171464 ( 3,171,464) <= rx_queue_0_packets /sec > Ethtool(ixgbe1) stat: 190344503 ( 190,344,503) <= rx_queue_1_bytes /sec > Ethtool(ixgbe1) stat: 3172408 ( 3,172,408) <= rx_queue_1_packets /sec > > Notice, each RX-CPU can only process 3.1Mpps. > > RPS RX-CPU(0): > > # Overhead CPU Symbol > # ........ ... ....................................... > # > 11.72% 000 [k] ixgbe_poll > 11.29% 000 [k] _raw_spin_lock > 10.35% 000 [k] dev_gro_receive > 8.36% 000 [k] __build_skb > 7.35% 000 [k] __skb_get_hash > 6.22% 000 [k] enqueue_to_backlog > 5.89% 000 [k] __skb_flow_dissect > 4.43% 000 [k] inet_gro_receive > 4.19% 000 [k] ___slab_alloc > 3.90% 000 [k] queued_spin_lock_slowpath > 3.85% 000 [k] kmem_cache_alloc > 3.06% 000 [k] build_skb > 2.66% 000 [k] get_rps_cpu > 2.57% 000 [k] napi_gro_receive > 2.34% 000 [k] eth_type_trans > 1.81% 000 [k] __cmpxchg_double_slab.isra.61 > 1.47% 000 [k] ixgbe_alloc_rx_buffers > 1.43% 000 [k] get_partial_node.isra.81 > 0.84% 000 [k] swiotlb_sync_single > 0.74% 000 [k] udp4_gro_receive > 0.73% 000 [k] netif_receive_skb_internal > 0.72% 000 [k] udp_gro_receive > 0.63% 000 [k] skb_gro_reset_offset > 0.49% 000 [k] __skb_flow_get_ports > 0.48% 000 [k] llist_add_batch > 0.36% 000 [k] swiotlb_sync_single_for_cpu > 0.34% 000 [k] __slab_alloc > > > Remote RPS-CPU(3) getting packets:: > > # Overhead CPU Symbol > # ........ ... .............................................. > # > 33.02% 003 [k] poll_idle > 10.99% 003 [k] __netif_receive_skb_core > 10.45% 003 [k] page_frag_free > 8.49% 003 [k] ip_rcv > 4.19% 003 [k] fib_table_lookup > 2.84% 003 [k] __udp4_lib_rcv > 2.81% 003 [k] __slab_free > 2.23% 003 [k] __udp4_lib_lookup > 2.09% 003 [k] ip_route_input_rcu > 2.07% 003 [k] kmem_cache_free > 2.06% 003 [k] udp_v4_early_demux > 1.73% 003 [k] ip_rcv_finish
Very interesting data. So above perf report compares to xdp-redirect-cpu this one: Perf top on a CPU(3) that have to alloc and free SKBs etc. # Overhead CPU Symbol # ........ ... ....................................... # 15.51% 003 [k] fib_table_lookup 8.91% 003 [k] cpu_map_kthread_run 8.04% 003 [k] build_skb 7.88% 003 [k] page_frag_free 5.13% 003 [k] kmem_cache_alloc 4.76% 003 [k] ip_route_input_rcu 4.59% 003 [k] kmem_cache_free 4.02% 003 [k] __udp4_lib_rcv 3.20% 003 [k] fib_validate_source 3.02% 003 [k] __netif_receive_skb_core 3.02% 003 [k] udp_v4_early_demux 2.90% 003 [k] ip_rcv 2.80% 003 [k] ip_rcv_finish right? and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu you can load it up all the way. Am I interpreting all this correctly that with RPS cpu0 cannot distributed the packets to other cpus fast enough and that's a bottleneck? whereas in redirect-cpu you're doing early packet distribution before skb alloc? So in other words with redirect-cpu all consumer cpus are doing skb alloc and in RPS cpu0 is allocating skbs for all ? and that's where 6M->12M performance gain comes from?