af_packet: support jumbo frames

Stephen Hemminger Thu, 12 Mar 2026 09:21:08 -0700

On Thu, 12 Mar 2026 14:32:48 +0100
Xavier Guillaume <[email protected]> wrote:


> Hi Stephen,
> 
> > I wonder if TPACKET header could go in mbuf headroom.
> > And also, could the copy on receive be avoided?  
> 
> Thank you for your review and the interesting questions. I had not
> considered these angles, so I took some time to look into it.
> 
> As far as I understand, the current RX path copies the packet data
> from the ring frame into an mbuf so that the ring slot can be returned to
> the kernel immediately after the copy. This keeps the ring available
> for new packets regardless of how long the application holds the mbuf.
> 
> Going down the zero-copy route would introduce a strong coupling
> between kernel-managed ring frames and DPDK-managed mbufs: the ring
> slot could not be released until the last reference to the mbuf is
> freed, which risks stalling the ring under any buffering.
> 
> Because of this copy and the resulting decoupling, the TPACKET header
> does not need to be carried into the mbuf at all. It is only read
> for metadata (packet length, VLAN, timestamp) before the frame is
> released back to the kernel.
> 
> In this context, my feeling is that the introduced risks outweigh the
> gains (the memcpy looks relatively small compared to the full kernel
> networking stack af_packet goes through).
> 
> Did I miss something?
> 
> Regards,
> Xavier

Copies matter, especially for larger packets.

I noticed that later kernels support TPACKET_V3 with sendmsg and MSG_ZEROCOPY
it was added in 4.18 kernel so should be ok; the downside is it goes from
ring to syscall per packet rather than syscall per burst.

For RX, you right it adds complexity.

Did some brainstorming (with AI as checking), and it looks like 
maybe some mixed mode where it uses zero copy on Rx until there
is some high watermark. Something like:


## The design

The receive path becomes:

1. At queue setup, register the entire mmap'd region as an external memory zone 
that DPDK knows about (via `rte_extmem_register` if needed for IOVA).

2. On each received frame, allocate an mbuf but attach it to the ring frame via 
`rte_pktmbuf_attach_extbuf` instead of copying. The `shinfo` free callback 
atomically sets `tp_status = TP_STATUS_KERNEL` to release the frame back to the 
kernel.

3. Advance `framenum` as normal — the frame stays owned by userspace until the 
mbuf is freed.

## The hard part: ring backpressure

This is the real design question. In the copy path, frames are returned to the 
kernel immediately in the RX loop. With zero-copy, a frame is held until the 
application frees the mbuf. If the app is slow or holds references (e.g., 
reassembly, batching into a burst for a worker core), you burn through ring 
slots fast.

A few options:

- **Large ring** — bump `framecnt` significantly. Memory is cheap and the ring 
is already mmap'd. For a capture workload this is usually fine.
- **Fallback to copy** — track how many frames are outstanding. When it crosses 
a watermark (say 75% of the ring), fall back to the memcpy path for new packets 
so you keep returning frames to the kernel. This is what the AF_XDP PMD does 
conceptually with its fill ring management.
- **Just drop** — if the ring is exhausted, that's backpressure. The kernel 
drops packets, which shows up in `tp_drops`. For monitoring/capture workloads 
this is often acceptable.

The fallback approach is probably the most robust for a general-purpose patch. 
Something roughly like:

```c
/* threshold: if outstanding frames exceed 75% of ring, fall back to copy */
bool zero_copy = (outstanding_frames < (framecount * 3 / 4));

if (zero_copy) {
    /* attach extbuf pointing into ring frame */
    rte_pktmbuf_attach_extbuf(mbuf, pbuf, pbuf_iova, data_len, shinfo);
    rte_pktmbuf_pkt_len(mbuf) = rte_pktmbuf_data_len(mbuf) = ppd->tp_snaplen;
    /* do NOT set tp_status = TP_STATUS_KERNEL here; callback does it */
    outstanding_frames++;
} else {
    /* copy path as before */
    rte_pktmbuf_pkt_len(mbuf) = rte_pktmbuf_data_len(mbuf) = ppd->tp_snaplen;
    memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, ppd->tp_snaplen);
    ppd->tp_status = TP_STATUS_KERNEL;
}
```

The `shinfo` callback would need an atomic decrement on the outstanding counter 
plus the `tp_status` write. You'd pre-allocate one `rte_mbuf_ext_shared_info` 
per frame slot at init time, each wired to its corresponding `tpacket2_hdr`.

One subtlety: `framenum` advancement is no longer gated on the current frame 
being released. You're advancing past frames that are still in-flight. So you 
need a separate counter or bitmap to know which frames are actually available 
when you wrap around. The simplest approach is to just check `tp_status` as you 
already do — if you come back around the ring and the frame is still held by 
userspace (status not `TP_STATUS_USER` from the kernel), you stop, same as 
today.

That actually works cleanly because the existing `tp_status` check at the top 
of the loop already handles this — a frame you haven't returned to the kernel 
won't have `TP_STATUS_USER` set, so the loop naturally stops.

Re: [PATCH v2 3/3] net/af_packet: support jumbo frames

Reply via email to