On Thu, 12 Mar 2026 14:32:48 +0100
Xavier Guillaume <[email protected]> wrote:
> Hi Stephen,
>
> > I wonder if TPACKET header could go in mbuf headroom.
> > And also, could the copy on receive be avoided?
>
> Thank you for your review and the interesting questions. I had not
> considered these angles, so I took some time to look into it.
>
> As far as I understand, the current RX path copies the packet data
> from the ring frame into an mbuf so that the ring slot can be returned to
> the kernel immediately after the copy. This keeps the ring available
> for new packets regardless of how long the application holds the mbuf.
>
> Going down the zero-copy route would introduce a strong coupling
> between kernel-managed ring frames and DPDK-managed mbufs: the ring
> slot could not be released until the last reference to the mbuf is
> freed, which risks stalling the ring under any buffering.
>
> Because of this copy and the resulting decoupling, the TPACKET header
> does not need to be carried into the mbuf at all. It is only read
> for metadata (packet length, VLAN, timestamp) before the frame is
> released back to the kernel.
>
> In this context, my feeling is that the introduced risks outweigh the
> gains (the memcpy looks relatively small compared to the full kernel
> networking stack af_packet goes through).
>
> Did I miss something?
>
> Regards,
> Xavier
Copies matter, especially for larger packets.
I noticed that later kernels support TPACKET_V3 with sendmsg and MSG_ZEROCOPY
it was added in 4.18 kernel so should be ok; the downside is it goes from
ring to syscall per packet rather than syscall per burst.
For RX, you right it adds complexity.
Did some brainstorming (with AI as checking), and it looks like
maybe some mixed mode where it uses zero copy on Rx until there
is some high watermark. Something like:
## The design
The receive path becomes:
1. At queue setup, register the entire mmap'd region as an external memory zone
that DPDK knows about (via `rte_extmem_register` if needed for IOVA).
2. On each received frame, allocate an mbuf but attach it to the ring frame via
`rte_pktmbuf_attach_extbuf` instead of copying. The `shinfo` free callback
atomically sets `tp_status = TP_STATUS_KERNEL` to release the frame back to the
kernel.
3. Advance `framenum` as normal — the frame stays owned by userspace until the
mbuf is freed.
## The hard part: ring backpressure
This is the real design question. In the copy path, frames are returned to the
kernel immediately in the RX loop. With zero-copy, a frame is held until the
application frees the mbuf. If the app is slow or holds references (e.g.,
reassembly, batching into a burst for a worker core), you burn through ring
slots fast.
A few options:
- **Large ring** — bump `framecnt` significantly. Memory is cheap and the ring
is already mmap'd. For a capture workload this is usually fine.
- **Fallback to copy** — track how many frames are outstanding. When it crosses
a watermark (say 75% of the ring), fall back to the memcpy path for new packets
so you keep returning frames to the kernel. This is what the AF_XDP PMD does
conceptually with its fill ring management.
- **Just drop** — if the ring is exhausted, that's backpressure. The kernel
drops packets, which shows up in `tp_drops`. For monitoring/capture workloads
this is often acceptable.
The fallback approach is probably the most robust for a general-purpose patch.
Something roughly like:
```c
/* threshold: if outstanding frames exceed 75% of ring, fall back to copy */
bool zero_copy = (outstanding_frames < (framecount * 3 / 4));
if (zero_copy) {
/* attach extbuf pointing into ring frame */
rte_pktmbuf_attach_extbuf(mbuf, pbuf, pbuf_iova, data_len, shinfo);
rte_pktmbuf_pkt_len(mbuf) = rte_pktmbuf_data_len(mbuf) = ppd->tp_snaplen;
/* do NOT set tp_status = TP_STATUS_KERNEL here; callback does it */
outstanding_frames++;
} else {
/* copy path as before */
rte_pktmbuf_pkt_len(mbuf) = rte_pktmbuf_data_len(mbuf) = ppd->tp_snaplen;
memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, ppd->tp_snaplen);
ppd->tp_status = TP_STATUS_KERNEL;
}
```
The `shinfo` callback would need an atomic decrement on the outstanding counter
plus the `tp_status` write. You'd pre-allocate one `rte_mbuf_ext_shared_info`
per frame slot at init time, each wired to its corresponding `tpacket2_hdr`.
One subtlety: `framenum` advancement is no longer gated on the current frame
being released. You're advancing past frames that are still in-flight. So you
need a separate counter or bitmap to know which frames are actually available
when you wrap around. The simplest approach is to just check `tp_status` as you
already do — if you come back around the ring and the frame is still held by
userspace (status not `TP_STATUS_USER` from the kernel), you stop, same as
today.
That actually works cleanly because the existing `tp_status` check at the top
of the loop already handles this — a frame you haven't returned to the kernel
won't have `TP_STATUS_USER` set, so the loop naturally stops.