On Fri, 29 May 2026 15:34:00 +0200 Thomas Monjalon <[email protected]> wrote:
> From: Gregory Etelson <[email protected]> > > Selective Rx may save some PCI bandwidth. > Implement selective Rx in the (quite slow) scalar SPRQ Rx path > mlx5_rx_burst() where the performance impact > of the added condition branches is acceptable. > Other Rx functions do not support this feature. > When using selective Rx, mlx5_rx_burst will be selected. > > A null Memory Region (MR) is always allocated > at shared device context initialization. > The selective Rx capability is not advertised > if this special MR allocation fails. > > For each Rx segment configured with a NULL mempool, > a "null mbuf" is created. > It is a fake mbuf allocated outside any mempool, > used as a placeholder in the Rx ring. > The null MR lkey is used in the WQE for these segments > so the NIC writes received data to a discard buffer. > The mbuf data room size is resolved from the first segment having a pool. > For null segments, the buffer length is from the last seen pool, > so that the WQE stride size remains consistent. > > In mlx5_rx_burst, discarded segments are not chained > into the packet mbuf list, NB_SEGS is decremented accordingly, > and no replacement buffer is allocated. > A separate data_seg_len accumulator tracks the total length > of delivered segments only. > The packet length is adjusted to reflect only the data > actually delivered to the application. > > Signed-off-by: Gregory Etelson <[email protected]> > Signed-off-by: Thomas Monjalon <[email protected]> > --- AI review with Opus 4.8 and High setting found one issue: Patch 6: net/mlx5: support selective Rx Error: NULL pointer dereference when the first configured Rx segment is a discard segment (mp == NULL). In mlx5_rx_burst() the head mbuf and the chain tail are tracked like this: if (pkt) { if (rep->pool) NEXT(tail) = rep; else --NB_SEGS(pkt); } ... if (seg->pool) { tail = seg; ... } tail is only ever assigned inside "if (seg->pool)", and pkt is set to the first processed segment unconditionally (pkt = seg in the !pkt block, no pool guard). So if the first segment of a packet is a discard segment: pkt becomes the null_mbuf (pool == NULL), tail stays NULL; on the next (real) segment, rep->pool is set, so NEXT(tail) = rep executes with tail == NULL -> write through NULL. Even without the crash, returning the pool-less null_mbuf as the packet head is wrong: the application later frees it back to a NULL pool. This is reachable, not theoretical. testpmd (patch 3) inserts a leading mp==NULL segment whenever the first offset is > 0 (seg_offset > next_offset with next_offset starting at 0), ethdev check_split (patch 2) now permits a leading NULL mp, and mlx5_rxq_new() accepts it (first_mp is just the first non-NULL pool; there is no requirement that rxseg[0].mp != NULL). The DTS cases selective_rx_payload_only (rxoffs=[34]) and selective_rx_two_segments (rxoffs=[14,...]) in patch 10 configure exactly this layout, and mlx5_selective_rx_enabled() forces the scalar mlx5_rx_burst path, so the buggy path is the one that runs. Trace for rxoffs=34 / rxpkts=payload (segments: discard[0,34) real[34,290) discard[290,max)): iter0 (discard head): pkt == NULL, seg->pool == NULL -> pkt = null_mbuf, tail not set; len(290) > DATA_LEN(34) -> ++NB_SEGS, continue. iter1 (real seg): pkt set, rep->pool != NULL -> NEXT(tail==NULL)=rep. Suggested fix: a discard segment must not become the packet head/tail. Either reject rxseg[0].mp == NULL in mlx5_rxq_new() (cleanest, matches the "deliver last N bytes" case being unsupported here), or make the data path skip leading discard segments without assigning them to pkt and only set pkt/tail on the first segment with a pool. If leading discard is intended to be supported, the head selection and NEXT(tail) linking both need to account for tail == NULL. The same head/tail assumption also means a packet that falls entirely within a leading discard segment would be returned with a NULL-pool head; fixing the above covers that too.

