On 17-01-03 08:54 AM, John Fastabend wrote: > On 17-01-02 10:01 PM, Jason Wang wrote: >> >> >> On 2017年01月03日 03:44, John Fastabend wrote: >>> Add support for XDP adjust head by allocating a 256B header region >>> that XDP programs can grow into. This is only enabled when a XDP >>> program is loaded. >>> >>> In order to ensure that we do not have to unwind queue headroom push >>> queue setup below bpf_prog_add. It reads better to do a prog ref >>> unwind vs another queue setup call. >>> >>> : There is a problem with this patch as is. When xdp prog is loaded >>> the old buffers without the 256B headers need to be flushed so that >>> the bpf prog has the necessary headroom. This patch does this by >>> calling the virtqueue_detach_unused_buf() and followed by the >>> virtnet_set_queues() call to reinitialize the buffers. However I >>> don't believe this is safe per comment in virtio_ring this API >>> is not valid on an active queue and the only thing we have done >>> here is napi_disable/napi_enable wrappers which doesn't do anything >>> to the emulation layer. >>> >>> So the RFC is really to find the best solution to this problem. >>> A couple things come to mind, (a) always allocate the necessary >>> headroom but this is a bit of a waste (b) add some bit somewhere >>> to check if the buffer has headroom but this would mean XDP programs >>> would be broke for a cycle through the ring, (c) figure out how >>> to deactivate a queue, free the buffers and finally reallocate. >>> I think (c) is the best choice for now but I'm not seeing the >>> API to do this so virtio/qemu experts anyone know off-hand >>> how to make this work? I started looking into the PCI callbacks >>> reset() and virtio_device_ready() or possibly hitting the right >>> set of bits with vp_set_status() but my first attempt just hung >>> the device. >> >> Hi John: >> >> AFAIK, disabling a specific queue was supported only by virtio 1.0 through >> queue_enable field in pci common cfg. But unfortunately, qemu does not >> emulate >> this at all and legacy device does not even support this. So the safe way is >> probably reset the device and redo the initialization here. >> > > OK, I'll draft up a fix with a full reset unless Michael has some idea in > the meantime. > >>> >>> Signed-off-by: John Fastabend <john.r.fastab...@intel.com> >>> --- >>> drivers/net/virtio_net.c | 106 >>> +++++++++++++++++++++++++++++++++++----------- >>> 1 file changed, 80 insertions(+), 26 deletions(-) >>> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >>> index 5deeda6..fcc5bd7 100644 >>> --- a/drivers/net/virtio_net.c >>> +++ b/drivers/net/virtio_net.c >>> @@ -159,6 +159,9 @@ struct virtnet_info { >>> /* Ethtool settings */ >>> u8 duplex; >>> u32 speed; >>> + >>> + /* Headroom allocated in RX Queue */ >>> + unsigned int headroom; >>> }; >>> struct padded_vnet_hdr { >>> @@ -355,6 +358,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi, >>> } >>> if (vi->mergeable_rx_bufs) { >>> + xdp->data -= sizeof(struct virtio_net_hdr_mrg_rxbuf); >>> /* Zero header and leave csum up to XDP layers */ >>> hdr = xdp->data; >>> memset(hdr, 0, vi->hdr_len); >>> @@ -371,7 +375,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi, >>> num_sg = 2; >>> sg_init_table(sq->sg, 2); >>> sg_set_buf(sq->sg, hdr, vi->hdr_len); >>> - skb_to_sgvec(skb, sq->sg + 1, 0, skb->len); >>> + skb_to_sgvec(skb, sq->sg + 1, vi->headroom, xdp->data_end - >>> xdp->data); >> >> vi->headroom look suspicious, should it be xdp->data - xdp->data_hard_start? >> > > Yep found this as well while testing small packet receive. > >>> } >>> err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, >>> data, GFP_ATOMIC); >>> @@ -393,34 +397,39 @@ static u32 do_xdp_prog(struct virtnet_info *vi, >>> struct bpf_prog *xdp_prog, >>> void *data, int len) >>> { >>> - int hdr_padded_len; >>> struct xdp_buff xdp; >>> - void *buf; >>> unsigned int qp; >>> u32 act; >>> + >>> if (vi->mergeable_rx_bufs) { >>> - hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); >>> - xdp.data = data + hdr_padded_len; >>> + int desc_room = sizeof(struct virtio_net_hdr_mrg_rxbuf); >>> + >>> + /* Allow consuming headroom but reserve enough space to push >>> + * the descriptor on if we get an XDP_TX return code. >>> + */ >>> + xdp.data_hard_start = data - vi->headroom + desc_room; >>> + xdp.data = data + desc_room; >>> xdp.data_end = xdp.data + (len - vi->hdr_len); >>> - buf = data; >>> } else { /* small buffers */ >>> struct sk_buff *skb = data; >>> - xdp.data = skb->data; >>> + xdp.data_hard_start = skb->data; >>> + xdp.data = skb->data + vi->headroom; >>> xdp.data_end = xdp.data + len; >>> - buf = skb->data; >>> } >>> act = bpf_prog_run_xdp(xdp_prog, &xdp); >>> switch (act) { >>> case XDP_PASS: >>> + if (!vi->mergeable_rx_bufs) >>> + __skb_pull((struct sk_buff *) data, >>> + xdp.data - xdp.data_hard_start); >> >> Instead of doing things here and virtnet_xdp_xmit(). How about always making >> skb->data point to the buffer head like: >> >> 1) reserve headroom in add_recvbuf_small() >> 2) skb_push(xdp->data - xdp_data_hard_start, skb) if we detect xdp->data was >> modified afer bpf_prog_run_xdp() >> >> Then there's no special code in either XDP_PASS or XDP_TX? >> > > Sure. Works for me and avoids if/else code. > >>> return XDP_PASS; >>> case XDP_TX: >>> qp = vi->curr_queue_pairs - >>> vi->xdp_queue_pairs + >>> smp_processor_id(); >> >> [...] >> >>> +#define VIRTIO_XDP_HEADROOM 256 >>> + >>> static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog) >>> { >>> unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr); >>> struct virtnet_info *vi = netdev_priv(dev); >>> struct bpf_prog *old_prog; >>> u16 xdp_qp = 0, curr_qp; >>> + unsigned int old_hr; >>> int i, err; >>> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) || >>> @@ -1736,19 +1751,58 @@ static int virtnet_xdp_set(struct net_device *dev, >>> struct bpf_prog *prog) >>> return -ENOMEM; >>> } >>> + old_hr = vi->headroom; >>> + if (prog) { >>> + prog = bpf_prog_add(prog, vi->max_queue_pairs - 1); >>> + if (IS_ERR(prog)) >>> + return PTR_ERR(prog); >>> + vi->headroom = VIRTIO_XDP_HEADROOM; >>> + } else { >>> + vi->headroom = 0; >>> + } >>> + >>> + /* Changing the headroom in buffers is a disruptive operation because >>> + * existing buffers must be flushed and reallocated. This will happen >>> + * when a xdp program is initially added or xdp is disabled by removing >>> + * the xdp program. >>> + */ >> >> We probably need reset the device here, but maybe Michale has more ideas. >> And if >> we do this, another interesting thing to do is to disable EWMA and always >> use a >> single page for each packet, this could almost eliminate linearizing. > > Well with normal MTU 1500 size we should not hit the linearizing case right? > The > question is should we cap the MTU at GOOD_PACKET_LEN vs the current cap of > (PAGE_SIZE - overhead).
Sorry responding to my own post with a bit more detail. I don't really like going to a page for each packet because we end up with double the pages in use for the "normal" 1500 MTU case. We could make the xdp allocation scheme smarter and allocate a page per packet when MTU is greater than 2k instead of using the EWMA but I would push those types of things at net-next and live with the linearizing behavior for now or capping the MTU. > >> >> Thanks >