On Fri, 23 Mar 2018 10:29:29 -0700 Alexander Duyck <alexander.du...@gmail.com> wrote:
> On Fri, Mar 23, 2018 at 5:19 AM, Jesper Dangaard Brouer > <bro...@redhat.com> wrote: > > Changing API xdp_return_frame() to take struct xdp_frame as argument, > > seems like a natural choice. But there are some subtle performance > > details here that needs extra care, which is a deliberate choice. > > > > When de-referencing xdp_frame on a remote CPU during DMA-TX > > completion, result in the cache-line is change to "Shared" > > state. Later when the page is reused for RX, then this xdp_frame > > cache-line is written, which change the state to "Modified". > > > > This situation already happens (naturally) for, virtio_net, tun and > > cpumap as the xdp_frame pointer is the queued object. In tun and > > cpumap, the ptr_ring is used for efficiently transferring cache-lines > > (with pointers) between CPUs. Thus, the only option is to > > de-referencing xdp_frame. > > > > It is only the ixgbe driver that had an optimization, in which it can > > avoid doing the de-reference of xdp_frame. The driver already have > > TX-ring queue, which (in case of remote DMA-TX completion) have to be > > transferred between CPUs anyhow. In this data area, we stored a > > struct xdp_mem_info and a data pointer, which allowed us to avoid > > de-referencing xdp_frame. > > > > To compensate for this, a prefetchw is used for telling the cache > > coherency protocol about our access pattern. My benchmarks show that > > this prefetchw is enough to compensate the ixgbe driver. > > > > Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com> > > I'm really not a fan of this patch. It seems like it is adding a ton > of overhead for one case that is going to penalize all of the other > use cases for XDP. Just the extra prefetch is going to have a > significant impact on things like the XDP_DROP case. > [...] > > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > index ff069597fccf..e6e9b28ecfba 100644 > > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > @@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector > > *q_vector, > > > > /* free the skb */ > > if (ring_is_xdp(tx_ring)) > > - xdp_return_frame(tx_buffer->data, > > &tx_buffer->xdp_mem); > > + xdp_return_frame(tx_buffer->xdpf); > > else > > napi_consume_skb(tx_buffer->skb, napi_budget); > > > > @@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector > > *q_vector, > > xdp.data_hard_start = xdp.data - > > ixgbe_rx_offset(rx_ring); > > xdp.data_end = xdp.data + size; > > + prefetchw(xdp.data_hard_start); /* xdp_frame write > > */ > > > > skb = ixgbe_run_xdp(adapter, rx_ring, &xdp); > > } > > Do we really need to be prefetching yet another cache line? This is > going to hurt general performance since for most cases you are now > prefetching a cache line that will never be used. My intuition were also that this would hurt performance. But I've done many tests, and they all show no performance regression from this change! XDP_DROP testing with xdp1 on ixgbe: Baseline: 14,703,344 pkt/s Patched: 14,711,574 pkt/s For people reproducing, notice that it requires tuning on the generator side (with pktgen) to reach these extreme speeds. As you might have noticed, the prefetchw is only active when a XDP program is loaded. Thus, I created a benchmark[1] that loads an XDP program and always returns XDP_PASS (via --action) Then, I drop packets in iptables raw table: Baseline: 5,783,509 pps Patched: 5,789,195 pps Then unload netfilter modules and let packets reach UDP step "UdpNoPorts" listening stage: Baseline: 3,832,956 pps Patched: 3,855,470 pps Then, add a udp_sink (--recvmsg) to allow packets to travel deeper into the network stack (and force udp_sink to run on another CPU): Baseline: 2,278,855 pps Patched: 2,270,373 pps By measurements, it should be clear that this patchset does not add "a ton of overhead" and does not "penalize all of the other use cases for XDP". For the XDP_REDIRECT use-case, I actually find the prefetchw() quite beautiful, because xdp_buff (which is stack variable) is used by the bpf_prog, and the prefetch have time to fetch the memory area that the conversion of xdp_buff to xdp_frame goes into, and xdp_buff is known to be hot. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_bench01_mem_access_cost_kern.c