On Fri, Mar 13, 2026 at 5:27 PM Vishwanath Seshagiri <[email protected]> wrote: > > On 3/13/26 1:21 PM, Jason Wang wrote: > > On Wed, Mar 11, 2026 at 2:31 AM Vishwanath Seshagiri <[email protected]> wrote: > >> > >> Use page_pool for RX buffer allocation in mergeable and small buffer > >> modes to enable page recycling and avoid repeated page allocator calls. > >> skb_mark_for_recycle() enables page reuse in the network stack. > >> > >> Big packets mode is unchanged because it uses page->private for linked > >> list chaining of multiple pages per buffer, which conflicts with > >> page_pool's internal use of page->private. > >> > >> Implement conditional DMA premapping using virtqueue_dma_dev(): > >> - When non-NULL (vhost, virtio-pci): use PP_FLAG_DMA_MAP with page_pool > >> handling DMA mapping, submit via virtqueue_add_inbuf_premapped() > >> - When NULL (VDUSE, direct physical): page_pool handles allocation only, > >> submit via virtqueue_add_inbuf_ctx() > >> > >> This preserves the DMA premapping optimization from commit 31f3cd4e5756b > >> ("virtio-net: rq submits premapped per-buffer") while adding page_pool > >> support as a prerequisite for future zero-copy features (devmem TCP, > >> io_uring ZCRX). > >> > >> Page pools are created in probe and destroyed in remove (not open/close), > >> following existing driver behavior where RX buffers remain in virtqueues > >> across interface state changes. > >> > >> Signed-off-by: Vishwanath Seshagiri <[email protected]> > >> --- > >> Changes in v11: > >> - add_recvbuf_small: encode alloc_len and xdp_headroom in ctx via > >> mergeable_len_to_ctx() so receive_small() recovers the actual buflen > >> via mergeable_ctx_to_truesize() (Michael S. Tsirkin) > >> - receive_small_build_skb, receive_small_xdp: accept buflen parameter > >> instead of recomputing it, to use the actual allocation size > >> - v10: > >> > >> https://lore.kernel.org/virtualization/[email protected]/ > >> > >> Changes in v10: > >> - add_recvbuf_small: use alloc_len to avoid clobbering len; v9 feedback > >> was about truesize under-accounting, not variable naming — misunderstood > >> the comment in v9 > >> - v9: > >> > >> https://lore.kernel.org/virtualization/[email protected]/ > >> > >> Changes in v9: > >> - Fix virtnet_skb_append_frag() for XSK callers (Michael S. Tsirkin) > >> - v8: > >> > >> https://lore.kernel.org/virtualization/[email protected]/ > >> > >> Changes in v8: > >> - Remove virtnet_no_page_pool() helper, replace with direct !rq->page_pool > >> checks or inlined conditions (Xuan Zhuo) > >> - Extract virtnet_rq_submit() helper to consolidate DMA/non-DMA buffer > >> submission in add_recvbuf_small() and add_recvbuf_mergeable() > >> - Add skb_mark_for_recycle(nskb) for overflow frag_list skbs in > >> virtnet_skb_append_frag() to ensure page_pool pages are returned to > >> the pool instead of freed via put_page() > >> - Rebase on net-next (kzalloc_objs API) > >> - v7: > >> > >> https://lore.kernel.org/virtualization/[email protected]/ > >> > >> Changes in v7: > >> - Replace virtnet_put_page() helper with direct page_pool_put_page() > >> calls (Xuan Zhuo) > >> - Add virtnet_no_page_pool() helper to consolidate big_packets mode check > >> (Michael S. Tsirkin) > >> - Add DMA sync_for_cpu for subsequent buffers in xdp_linearize_page() when > >> use_page_pool_dma is set (Michael S. Tsirkin) > >> - Remove unused pp_params.dev assignment in non-DMA path > >> - Add page pool recreation in virtnet_restore_up() for freeze/restore > >> support (Chris Mason's > >> Review Prompt) > >> - v6: > >> > >> https://lore.kernel.org/virtualization/[email protected]/ > >> > >> Changes in v6: > >> - Drop page_pool_frag_offset_add() helper and switch to > >> page_pool_alloc_va(); > >> page_pool_alloc_netmem() already handles internal fragmentation > >> internally > >> (Jakub Kicinski) > >> - v5: > >> > >> https://lore.kernel.org/virtualization/[email protected]/ > >> > >> Benchmark results: > >> > >> Configuration: pktgen TX -> tap -> vhost-net | virtio-net RX -> XDP_DROP > >> > >> Small packets (64 bytes, mrg_rxbuf=off): > >> 1Q: 853,493 -> 868,923 pps (+1.8%) > >> 2Q: 1,655,793 -> 1,696,707 pps (+2.5%) > >> 4Q: 3,143,375 -> 3,302,511 pps (+5.1%) > >> 8Q: 6,082,590 -> 6,156,894 pps (+1.2%) > >> > >> Mergeable RX (64 bytes): > >> 1Q: 766,168 -> 814,493 pps (+6.3%) > >> 2Q: 1,384,871 -> 1,670,639 pps (+20.6%) > >> 4Q: 2,773,081 -> 3,080,574 pps (+11.1%) > >> 8Q: 5,600,615 -> 6,043,891 pps (+7.9%) > >> > >> Mergeable RX (1500 bytes): > >> 1Q: 741,579 -> 785,442 pps (+5.9%) > >> 2Q: 1,310,043 -> 1,534,554 pps (+17.1%) > >> 4Q: 2,748,700 -> 2,890,582 pps (+5.2%) > >> 8Q: 5,348,589 -> 5,618,664 pps (+5.0%) > >> > >> drivers/net/Kconfig | 1 + > >> drivers/net/virtio_net.c | 497 ++++++++++++++++++++------------------- > >> 2 files changed, 251 insertions(+), 247 deletions(-) > >> > >> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig > >> index 17108c359216..b2fd90466bab 100644 > >> --- a/drivers/net/Kconfig > >> +++ b/drivers/net/Kconfig > >> @@ -452,6 +452,7 @@ config VIRTIO_NET > >> depends on VIRTIO > >> select NET_FAILOVER > >> select DIMLIB > >> + select PAGE_POOL > >> help > >> This is the virtual network driver for virtio. It can be used > >> with > >> QEMU based VMMs (like KVM or Xen). Say Y or M. > >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > >> index 72d6a9c6a5a2..a85d75a7f539 100644 > >> --- a/drivers/net/virtio_net.c > >> +++ b/drivers/net/virtio_net.c > >> @@ -26,6 +26,7 @@ > >> #include <net/netdev_rx_queue.h> > >> #include <net/netdev_queues.h> > >> #include <net/xdp_sock_drv.h> > >> +#include <net/page_pool/helpers.h> > >> > >> static int napi_weight = NAPI_POLL_WEIGHT; > >> module_param(napi_weight, int, 0444); > >> @@ -290,14 +291,6 @@ struct virtnet_interrupt_coalesce { > >> u32 max_usecs; > >> }; > >> > >> -/* The dma information of pages allocated at a time. */ > >> -struct virtnet_rq_dma { > >> - dma_addr_t addr; > >> - u32 ref; > >> - u16 len; > >> - u16 need_sync; > >> -}; > >> - > >> /* Internal representation of a send virtqueue */ > >> struct send_queue { > >> /* Virtqueue associated with this send _queue */ > >> @@ -356,8 +349,10 @@ struct receive_queue { > >> /* Average packet length for mergeable receive buffers. */ > >> struct ewma_pkt_len mrg_avg_pkt_len; > >> > >> - /* Page frag for packet buffer allocation. */ > >> - struct page_frag alloc_frag; > >> + struct page_pool *page_pool; > >> + > >> + /* True if page_pool handles DMA mapping via PP_FLAG_DMA_MAP */ > >> + bool use_page_pool_dma; > >> > >> /* RX: fragments + linear part + virtio header */ > >> struct scatterlist sg[MAX_SKB_FRAGS + 2]; > >> @@ -370,9 +365,6 @@ struct receive_queue { > >> > >> struct xdp_rxq_info xdp_rxq; > >> > >> - /* Record the last dma info to free after new pages is allocated. > >> */ > >> - struct virtnet_rq_dma *last_dma; > >> - > >> struct xsk_buff_pool *xsk_pool; > >> > >> /* xdp rxq used by xsk */ > >> @@ -521,11 +513,14 @@ static int virtnet_xdp_handler(struct bpf_prog > >> *xdp_prog, struct xdp_buff *xdp, > >> struct virtnet_rq_stats *stats); > >> static void virtnet_receive_done(struct virtnet_info *vi, struct > >> receive_queue *rq, > >> struct sk_buff *skb, u8 flags); > >> -static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb, > >> +static struct sk_buff *virtnet_skb_append_frag(struct receive_queue *rq, > >> + struct sk_buff *head_skb, > >> struct sk_buff *curr_skb, > >> struct page *page, void > >> *buf, > >> int len, int truesize); > >> static void virtnet_xsk_completed(struct send_queue *sq, int num); > >> +static void free_unused_bufs(struct virtnet_info *vi); > >> +static void virtnet_del_vqs(struct virtnet_info *vi); > >> > >> enum virtnet_xmit_type { > >> VIRTNET_XMIT_TYPE_SKB, > >> @@ -709,12 +704,10 @@ static struct page *get_a_page(struct receive_queue > >> *rq, gfp_t gfp_mask) > >> static void virtnet_rq_free_buf(struct virtnet_info *vi, > >> struct receive_queue *rq, void *buf) > >> { > >> - if (vi->mergeable_rx_bufs) > >> - put_page(virt_to_head_page(buf)); > >> - else if (vi->big_packets) > >> + if (!rq->page_pool) > >> give_pages(rq, buf); > >> else > >> - put_page(virt_to_head_page(buf)); > >> + page_pool_put_page(rq->page_pool, virt_to_head_page(buf), > >> -1, false); > >> } > >> > >> static void enable_rx_mode_work(struct virtnet_info *vi) > >> @@ -876,10 +869,16 @@ static struct sk_buff *page_to_skb(struct > >> virtnet_info *vi, > >> skb = virtnet_build_skb(buf, truesize, p - buf, len); > >> if (unlikely(!skb)) > >> return NULL; > >> + /* Big packets mode chains pages via page->private, which > >> is > >> + * incompatible with the way page_pool uses page->private. > >> + * Currently, big packets mode doesn't use page pools. > >> + */ > >> + if (!rq->page_pool) { > >> + page = (struct page *)page->private; > >> + if (page) > >> + give_pages(rq, page); > >> + } > >> > >> - page = (struct page *)page->private; > >> - if (page) > >> - give_pages(rq, page); > >> goto ok; > >> } > >> > >> @@ -925,133 +924,16 @@ static struct sk_buff *page_to_skb(struct > >> virtnet_info *vi, > >> hdr = skb_vnet_common_hdr(skb); > >> memcpy(hdr, hdr_p, hdr_len); > >> if (page_to_free) > >> - put_page(page_to_free); > >> + page_pool_put_page(rq->page_pool, page_to_free, -1, true); > >> > >> return skb; > >> } > >> > >> -static void virtnet_rq_unmap(struct receive_queue *rq, void *buf, u32 len) > >> -{ > >> - struct virtnet_info *vi = rq->vq->vdev->priv; > >> - struct page *page = virt_to_head_page(buf); > >> - struct virtnet_rq_dma *dma; > >> - void *head; > >> - int offset; > >> - > >> - BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs); > >> - > >> - head = page_address(page); > >> - > >> - dma = head; > >> - > >> - --dma->ref; > >> - > >> - if (dma->need_sync && len) { > >> - offset = buf - (head + sizeof(*dma)); > >> - > >> - virtqueue_map_sync_single_range_for_cpu(rq->vq, dma->addr, > >> - offset, len, > >> - DMA_FROM_DEVICE); > >> - } > >> - > >> - if (dma->ref) > >> - return; > >> - > >> - virtqueue_unmap_single_attrs(rq->vq, dma->addr, dma->len, > >> - DMA_FROM_DEVICE, > >> DMA_ATTR_SKIP_CPU_SYNC); > >> - put_page(page); > >> -} > >> - > >> static void *virtnet_rq_get_buf(struct receive_queue *rq, u32 *len, void > >> **ctx) > >> { > >> - struct virtnet_info *vi = rq->vq->vdev->priv; > >> - void *buf; > >> - > >> - BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs); > >> + BUG_ON(!rq->page_pool); > >> > >> - buf = virtqueue_get_buf_ctx(rq->vq, len, ctx); > >> - if (buf) > >> - virtnet_rq_unmap(rq, buf, *len); > >> - > >> - return buf; > >> -} > >> - > >> -static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, > >> u32 len) > >> -{ > >> - struct virtnet_info *vi = rq->vq->vdev->priv; > >> - struct virtnet_rq_dma *dma; > >> - dma_addr_t addr; > >> - u32 offset; > >> - void *head; > >> - > >> - BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs); > >> - > >> - head = page_address(rq->alloc_frag.page); > >> - > >> - offset = buf - head; > >> - > >> - dma = head; > >> - > >> - addr = dma->addr - sizeof(*dma) + offset; > >> - > >> - sg_init_table(rq->sg, 1); > >> - sg_fill_dma(rq->sg, addr, len); > >> -} > >> - > >> -static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t > >> gfp) > >> -{ > >> - struct page_frag *alloc_frag = &rq->alloc_frag; > >> - struct virtnet_info *vi = rq->vq->vdev->priv; > >> - struct virtnet_rq_dma *dma; > >> - void *buf, *head; > >> - dma_addr_t addr; > >> - > >> - BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs); > >> - > >> - head = page_address(alloc_frag->page); > >> - > >> - dma = head; > >> - > >> - /* new pages */ > >> - if (!alloc_frag->offset) { > >> - if (rq->last_dma) { > >> - /* Now, the new page is allocated, the last dma > >> - * will not be used. So the dma can be unmapped > >> - * if the ref is 0. > >> - */ > >> - virtnet_rq_unmap(rq, rq->last_dma, 0); > >> - rq->last_dma = NULL; > >> - } > >> - > >> - dma->len = alloc_frag->size - sizeof(*dma); > >> - > >> - addr = virtqueue_map_single_attrs(rq->vq, dma + 1, > >> - dma->len, > >> DMA_FROM_DEVICE, 0); > >> - if (virtqueue_map_mapping_error(rq->vq, addr)) > >> - return NULL; > >> - > >> - dma->addr = addr; > >> - dma->need_sync = virtqueue_map_need_sync(rq->vq, addr); > >> - > >> - /* Add a reference to dma to prevent the entire dma from > >> - * being released during error handling. This reference > >> - * will be freed after the pages are no longer used. > >> - */ > >> - get_page(alloc_frag->page); > >> - dma->ref = 1; > >> - alloc_frag->offset = sizeof(*dma); > >> - > >> - rq->last_dma = dma; > >> - } > >> - > >> - ++dma->ref; > >> - > >> - buf = head + alloc_frag->offset; > >> - > >> - get_page(alloc_frag->page); > >> - alloc_frag->offset += size; > >> - > >> - return buf; > >> + return virtqueue_get_buf_ctx(rq->vq, len, ctx); > >> } > >> > >> static void virtnet_rq_unmap_free_buf(struct virtqueue *vq, void *buf) > >> @@ -1067,9 +949,6 @@ static void virtnet_rq_unmap_free_buf(struct > >> virtqueue *vq, void *buf) > >> return; > >> } > >> > >> - if (!vi->big_packets || vi->mergeable_rx_bufs) > >> - virtnet_rq_unmap(rq, buf, 0); > >> - > >> virtnet_rq_free_buf(vi, rq, buf); > >> } > >> > >> @@ -1335,7 +1214,7 @@ static int xsk_append_merge_buffer(struct > >> virtnet_info *vi, > >> > >> truesize = len; > >> > >> - curr_skb = virtnet_skb_append_frag(head_skb, curr_skb, > >> page, > >> + curr_skb = virtnet_skb_append_frag(rq, head_skb, > >> curr_skb, page, > >> buf, len, truesize); > >> if (!curr_skb) { > >> put_page(page); > >> @@ -1771,7 +1650,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, > >> return ret; > >> } > >> > >> -static void put_xdp_frags(struct xdp_buff *xdp) > >> +static void put_xdp_frags(struct receive_queue *rq, struct xdp_buff *xdp) > >> { > >> struct skb_shared_info *shinfo; > >> struct page *xdp_page; > >> @@ -1781,7 +1660,7 @@ static void put_xdp_frags(struct xdp_buff *xdp) > >> shinfo = xdp_get_shared_info_from_buff(xdp); > >> for (i = 0; i < shinfo->nr_frags; i++) { > >> xdp_page = skb_frag_page(&shinfo->frags[i]); > >> - put_page(xdp_page); > >> + page_pool_put_page(rq->page_pool, xdp_page, -1, > >> true); > >> } > >> } > >> } > >> @@ -1873,7 +1752,7 @@ static struct page *xdp_linearize_page(struct > >> net_device *dev, > >> if (page_off + *len + tailroom > PAGE_SIZE) > >> return NULL; > >> > >> - page = alloc_page(GFP_ATOMIC); > >> + page = page_pool_alloc_pages(rq->page_pool, GFP_ATOMIC); > >> if (!page) > >> return NULL; > >> > >> @@ -1896,8 +1775,12 @@ static struct page *xdp_linearize_page(struct > >> net_device *dev, > >> p = virt_to_head_page(buf); > >> off = buf - page_address(p); > >> > >> + if (rq->use_page_pool_dma) > >> + page_pool_dma_sync_for_cpu(rq->page_pool, p, > >> + off, buflen); > > > > Intresting, I think we need a patch for -stable to sync for cpu as > > well (and probably the XDP_TX path). > > > > > >> + > >> if (check_mergeable_len(dev, ctx, buflen)) { > >> - put_page(p); > >> + page_pool_put_page(rq->page_pool, p, -1, true); > >> goto err_buf; > >> } > >> > >> @@ -1905,38 +1788,36 @@ static struct page *xdp_linearize_page(struct > >> net_device *dev, > >> * is sending packet larger than the MTU. > >> */ > >> if ((page_off + buflen + tailroom) > PAGE_SIZE) { > >> - put_page(p); > >> + page_pool_put_page(rq->page_pool, p, -1, true); > >> goto err_buf; > >> } > >> > >> memcpy(page_address(page) + page_off, > >> page_address(p) + off, buflen); > >> page_off += buflen; > >> - put_page(p); > >> + page_pool_put_page(rq->page_pool, p, -1, true); > >> } > >> > >> /* Headroom does not contribute to packet length */ > >> *len = page_off - XDP_PACKET_HEADROOM; > >> return page; > >> err_buf: > >> - __free_pages(page, 0); > >> + page_pool_put_page(rq->page_pool, page, -1, true); > >> return NULL; > >> } > >> > >> static struct sk_buff *receive_small_build_skb(struct virtnet_info *vi, > >> unsigned int xdp_headroom, > >> void *buf, > >> - unsigned int len) > >> + unsigned int len, > >> + unsigned int buflen) > >> { > >> unsigned int header_offset; > >> unsigned int headroom; > >> - unsigned int buflen; > >> struct sk_buff *skb; > >> > >> header_offset = VIRTNET_RX_PAD + xdp_headroom; > >> headroom = vi->hdr_len + header_offset; > >> - buflen = SKB_DATA_ALIGN(GOOD_PACKET_LEN + headroom) + > >> - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > >> > > > > Any reason for remvoing this? > > page_pool_alloc_va() can return a larger allocation than requested as it > appends remaining fragment space to avoid truesize underestimation > (comment in page_pool_alloc_netmem() in helpers.h). The old hardcoded > computation would always produce the requested ~512 bytes, ignoring any > extra space page_pool gave us, so build_skb() would set skb->truesize > too low. To pass the real size through: add_recvbuf_small() encodes > alloc_len and xdp_headroom into ctx via mergeable_len_to_ctx(alloc_len, > xdp_headroom). On the receive side, receive_small() extracts it with > mergeable_ctx_to_truesize(ctx) and passes it as the buflen parameter to > receive_small_build_skb(). >
Right. So Acked-by: Jason Wang <[email protected]> Thanks

