On Thu, May 29, 2025 at 05:20:25AM +0000, Lombardo, Ed wrote: > Hi, > > I have an issue with DPDK 24.11.1 and 2 port 100G Intel NIC > (E810-C) on 22 core CPU dual socket server. > > > There is a dedicated CPU core to get the packets from DPDK using > rte_eth_rx_burst() and enqueue the mbufs into a worker ring Q. > This thread does nothing else. The NIC is dropping packets at 8.5 > Gbps per port. > > > Studying the perf report, I was interested in the > common_ring_mc_dequeue(). Perf tool shows common_ring_mc_dequeue() > 92.86% Self and 92.86% Children. > > > I see further with perf tool rte_ring_enqueue_bulk() and > rte_ring_enqueue_bulk_elem(). These are at 0.00% Self and 0.05% > Children. > > Perf tool shows rte_ring_sp_enqueue_bulk_elem (inlined) which is > what I wanted to see (Single producer) representing the enqueue of > the mbufs pointers to the worker ring Q. > > > Is it possible to change the common_ring_mc_dequeue() to > common_ring_sc_dequeue()? Can it be set to one consumer on single > Queue 0. > > > I believe this is limiting DPDK from reaching 90 Gbps or higher in > my setup, which is my goal. > > > I made sure the E810-C firmware was up to date, NIC FW Version: > 4.80 0x80020543 1.3805.0 > > > Perf report shows: > > - 99.65% input_thread > > > - 99.35% rte_eth_rx_burst (inlined) > > > - ice_recv_scattered_pkts > > > 92.83% common_ring_mc_dequeue > > > Any thoughts or suggestions? > Since this is presumably from the thread that is doing the enqueuing to the ring, that means that the common_ring_mc_dequeue is from the memory pool implementation rather than your application ring. A certain amount of ring dequeue time would be expected due to the buffers getting allocated on one core and (presumably) freed or transmitted on a different one. However, the amount of time spent in that dequeue seems excessive.
Some suggestions: * I'd suggest checking what the mempool cache size in use in your application is, and increasing it. The underlying ring implementation is only used once there are no buffers in the mempool cache. Therefore having a larger cache should lead to fewer (but large) dequeues from the mempool. A cache size of 512 would be what I would suggest trying. * Since you are moving buffers from the allocation core, to avoid cycling through the whole mempool memory space I'd suggest switching the underlying mempool implementation from the default ring one to the stack mempool. Although that mempool implementation uses locks, it gives better buffer recycling across cores. * It is possible to switch the ring implementation to use an "sc_dequeue" function, but I would view that as risky, since it would mean that no core other than the Rx core can ever allocate a buffer from the pool (unless you start adding locks, at which point you might as well keep the "mc_dequeue" implementation). Therefore, I'd take the two points above as better alternatives. * Final thing to check - ensure you are not running out of buffers in the pool. If you run out of buffers, then each dequeue will have the refill function again checking the mempool ring to get more buffers, rather than having a store locally in the per-core cache. Try over-provisionning your memory pool a bit and see if it helps. Regards, /Bruce