> From: Bruce Richardson [mailto:[email protected]] > Sent: Tuesday, 26 May 2026 11.40 > > On Tue, May 26, 2026 at 10:41:44AM +0200, Morten Brørup wrote: > > > From: Bruce Richardson [mailto:[email protected]] > > > Sent: Friday, 22 May 2026 18.12 > > > > > > On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Brørup wrote: > > > > This patch refactors the mempool cache to eliminate some > unexpected > > > > behaviour and reduce the mempool cache miss rate. > > > > > > > > > > Agree in principle with most of these changes. As we dicussed at > the > > > DPDK > > > summit conference, only issue I really have is with the threshold > > > limits > > > here - allocating and freeing only half the cache at a time seems > > > overly > > > conservative. Thinking about use-cases: > > > > > > 1 for apps where alloc + free (generally Rx+Tx) is on the same > core(s), > > > then we should run (almost) entirely out of cache. > > > > I strongly disagree about any goal to run the cache low. > > The primary goal is to minimize the cache miss (refill and replenish) > rate. > > > > > 2 for apps where we have alloc and free on different cores, then we > > > have > > > some caches always being filled and others always being emptied > > > > Agree. > > > > > > > > For case #1, we only need worry about the thresholds for the odd > case > > > where we have a burst that causes us to overflow our cache (and we > > > can't increase our cache size to cope and avoid that). > > > Otherwise the thresholds don't matter. > > > > It seems like you assume the application only does something like > this: > > Rx -> Rewrite -> Tx > > > > In that case, the per-lcore cache only needs capacity for one burst, > yes. > > With my patch, the cache can be rightsized by requesting a cache size > of 2 * burst size. > > (Then the fill level will be either size/2 or empty, i.e. one burst > or zero. > > This also happens to meet your suggested goal about low fill level, > which I disagree with.) > > > > However, I don't think that is a realistic use case. > > Many apps do something like this: > > > > |-> Rewrite ->| > > Rx ->| |-> Tx > > |-> Hold | > > Release ->| > > > > I agree, that's why our cache size needs to be bigger than our burst > size. > > > They often hold back packets before they are transmitted. > > For a simple router, when the destination IP address is not in the > neighbor table, packets to that IP address are queued until ARP/ND has > been resolved, and then they are dequeued and transmitted. > > Or apps performing shaping or pacing, where packets are held back in > queues, and dequeued at the appropriate time. > > For such apps, the waves are much bigger (than the simple Rx- > >Rewrite->Tx use case). > > > > With a random enqueue/dequeue pattern, replenishing/draining the > cache to size/2 minimizes the probability of reaching one of its edges > (empty or full), triggering a "cache miss" (refill/replenish). > > > > > However, for case #2, the thresholds are constantly involved as > > > we > > > always are going to backing store. In this case, we really want to > have > > > the > > > allocs *always* fill the cache completely, and the frees completely > > > empty > > > the cache. > > > > Agree. > > > > > > > > Because of this, while we want to avoid cases where we fill the > cache > > > completely only to have a further free causing it to be flushed, > > > because of > > > case #2 we cannot be overly conservative in how much we free/empty. > > > Ideally, we want to fill to full less a single burst, and empty > leaving > > > only a single burst in the cache. Unfortunately, we don't know what > > > those > > > burst limits are, so we have to try and guess the best behaviour > from > > > everything else. > > > > I agree about not wanting to be overly conservative. > > But in the use cases I have described for #1, I don't think a target > fill level of size/2 is overly conservative. > > > > I also acknowledge that this patch doubles the mempool cache miss > rate for #2. > > E.g. with a cache size of 512 and burst size of 64, the per-burst > miss rate will be 64 / (1/2 * 512) = 1/4, compared to 64 / 512 = 1/8 > with a full replenish/drain algorithm. > > > > In theory, we could make it build time configurable to optimize > mempools for #2. > > But mempools are also used for other objects than mbufs, so that > would have unwanted side effects for non-mbuf mempools. > > > > If we went for an algorithm targeting replenish/drain at 25 % from > the edges, the per-burst miss rate for #2 would be: 64 / (3/4 * 512) = > 1/6. > > > > How about addressing #2 in the release notes: > > We describe that the cache refill/drain algorithm has been changed to > only refill/drain to 50 % of the cache size, so pipelined applications > performing Rx (mempool get) and Tx (mempool put) on separate cores > should configure their mbuf pools with double the cache size of what > they previously were to achieve similar performance. > > > > > > > > All that said, commits with specific suggestions inline. > > > > > > /Bruce > > > > > > <snip> > > > > > > > diff --git a/lib/mempool/rte_mempool.h > b/lib/mempool/rte_mempool.h > > > > index 2e54fc4466..432c43ab15 100644 > > > > --- a/lib/mempool/rte_mempool.h > > > > +++ b/lib/mempool/rte_mempool.h > > > > @@ -89,7 +89,7 @@ struct __rte_cache_aligned > rte_mempool_debug_stats > > > { > > > > */ > > > > struct __rte_cache_aligned rte_mempool_cache { > > > > uint32_t size; /**< Size of the cache */ > > > > - uint32_t flushthresh; /**< Threshold before we flush excess > > > elements */ > > > > + uint32_t flushthresh; /**< Obsolete; for API/ABI > compatibility > > > purposes only */ > > > > uint32_t len; /**< Current cache count */ > > > > #ifdef RTE_LIBRTE_MEMPOOL_STATS > > > > uint32_t unused; > > > > @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache > { > > > > /** > > > > * Cache objects > > > > * > > > > - * Cache is allocated to this size to allow it to overflow > in > > > certain > > > > - * cases to avoid needless emptying of cache. > > > > + * Note: > > > > + * Cache is allocated at double size for API/ABI > compatibility > > > purposes only. > > > > + * When reducing its size at an API/ABI breaking release, > > > > + * remember to add a cache guard after it. > > > > */ > > > > alignas(RTE_CACHE_LINE_SIZE) void > > > *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; > > > > }; > > > > @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp); > > > > * @param cache_size > > > > * If cache_size is non-zero, the rte_mempool library will try > to > > > > * limit the accesses to the common lockless pool, by > maintaining > > > a > > > > - * per-lcore object cache. This argument must be lower or > equal to > > > > - * RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. > > > > + * per-lcore object cache. This argument must be an even > number, > > > > + * lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n. > > > > * The access to the per-lcore table is of course > > > > * faster than the multi-producer/consumer pool. The cache can > be > > > > * disabled if the cache_size argument is set to 0; it can be > > > useful to > > > > * avoid losing objects in cache. > > > > + * Note: > > > > + * Mempool put/get requests of more than cache_size / 2 > objects > > > may be > > > > + * partially or fully served directly by the multi- > > > producer/consumer > > > > + * pool, to avoid the overhead of copying the objects twice > > > (instead of > > > > + * once) when using the cache as a bounce buffer. > > > > * @param private_data_size > > > > * The size of the private data appended after the mempool > > > > * structure. This is useful for storing some private data > after > > > the > > > > @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct > rte_mempool > > > *mp, void * const *obj_table, > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1); > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n); > > > > > > > > - __rte_assume(cache->flushthresh <= > RTE_MEMPOOL_CACHE_MAX_SIZE * > > > 2); > > > > - __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2); > > > > - __rte_assume(cache->len <= cache->flushthresh); > > > > - if (likely(cache->len + n <= cache->flushthresh)) { > > > > + __rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE); > > > > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE > / 2); > > > > + __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE); > > > > + __rte_assume(cache->len <= cache->size); > > > > + if (likely(cache->len + n <= cache->size)) { > > > > /* Sufficient room in the cache for the objects. */ > > > > cache_objs = &cache->objs[cache->len]; > > > > cache->len += n; > > > > - } else if (n <= cache->flushthresh) { > > > > + } else if (n <= cache->size / 2) { > > > > /* > > > > - * The cache is big enough for the objects, but - as > > > detected by > > > > - * the comparison above - has insufficient room for > them. > > > > - * Flush the cache to make room for the objects. > > > > + * The number of objects is within the cache bounce > buffer > > > limit, > > > > + * but - as detected by the comparison above - the > cache > > > has > > > > + * insufficient room for them. > > > > + * Flush the cache to the backend to make room for > the > > > objects; > > > > + * flush (size / 2) objects from the bottom of the > cache, > > > where > > > > + * objects are less hot, and move down the remaining > > > objects, which > > > > + * are more hot, from the upper half of the cache. > > > > */ > > > > - cache_objs = &cache->objs[0]; > > > > - rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache- > >len); > > > > - cache->len = n; > > > > + __rte_assume(cache->len > cache->size / 2); > > > > + rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], > cache- > > > >size / 2); > > > > + rte_memcpy(&cache->objs[0], &cache->objs[cache->size > / 2], > > > > + sizeof(void *) * (cache->len - cache- > >size / > > > 2)); > > > > + cache_objs = &cache->objs[cache->len - cache->size / > 2]; > > > > + cache->len = cache->len - cache->size / 2 + n; > > > > > > The flushing of only half the cache I'm not so certain about. I > agree > > > that > > > we want to not flush to empty, but I also think that we want to do > more > > > than a half-flush, especially since we do an enqueue to the cache > > > immediately afterwards. Consider the case where we have a cache > size of > > > 128, and we do an enqueue of 32, with the cache currently full. In > that > > > case we only flush 64, reducing the cache to 64, but then > immediately > > > bringing it back up to 96. > > > > I thought in depth about whether the flush/replenish sizes should > consider the request size or not. (E.g. if I should replenish size/2 or > size/2+request.) > > I decided for not considering the request size, for two reasons: > > a) It roughly doesn't matter, especially when considering a sequence > of random get/put requests. > > b) The size of the backend transactions become fixed, which has > performance benefits: With my patch, they are always size/2, so if the > cache size is 2^N, the backend transactions are 2^N and CPU cache > aligned. > > > > > For cases where we have pipelines with all > > > alloc > > > on one core and all free on another, this half-flush would be > > > inefficient. > > > > > > Instead, I would look to have a lower target threshold post-flush, > and > > > I > > > would suggest 25% - taking into account the newly freed buffers. > > > > It's not good for #1. > > I agree that it is better for #2. But I don't think #2 is the likely > use case. > > > > After our discussion at the summit, I did start working a patch > targeting fill levels at 25% from the cache edges, but I don't think > it's better; so I'd rather stick with a target fill level of 50%. > > > > > For example: > > > > > > /* if n > our target of 1/4 full, flush everything, > > > * else flush so that we end up with 1/4 full after n added. > > > */ > > > flush_count = n > cache->size/4 ? cache->len : > > > (cache->len + n) - cache->size/4; > > > > > > > > > > } else { > > > > - /* The request itself is too big for the cache. */ > > > > + /* The request itself is too big. */ > > > > goto driver_enqueue_stats_incremented; > > > > > > I think original comment is better. The request itself is not too > big > > > for > > > the whole mempool, just for the cache. > > > > Ack. > > > > > > > > > } > > > > > > > > @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct > rte_mempool > > > *mp, void **obj_table, > > > > /* The cache is a stack, so copy will be in reverse order. > */ > > > > cache_objs = &cache->objs[cache->len]; > > > > > > > > - __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2); > > > > + __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE); > > > > if (likely(n <= cache->len)) { > > > > /* The entire request can be satisfied from the > cache. */ > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, > 1); > > > > @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct > rte_mempool > > > *mp, void **obj_table, > > > > for (index = 0; index < len; index++) > > > > *obj_table++ = *--cache_objs; > > > > > > > > - /* Dequeue below would overflow mem allocated for cache? */ > > > > - if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE)) > > > > + /* Dequeue below would exceed the cache bounce buffer > limit? */ > > > > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE > / 2); > > > > + if (unlikely(remaining > cache->size / 2)) > > > > goto driver_dequeue; > > > > > > > > - /* Fill the cache from the backend; fetch size + remaining > > > objects. */ > > > > - ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, > > > > - cache->size + remaining); > > > > + /* Fill the cache from the backend; fetch (size / 2) > objects. */ > > > > + ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache- > >size / > > > 2); > > > > > > Again, the cache->size / 2 doesn't seem right here. We at most > half- > > > fill > > > the cache and then take some objects from that, meaning that have > just > > > done > > > a re-fill of cache but end the function with it less than half > full. > > > Since > > > we take from this value, I'd suggest just filling the cache > completely. > > > > The issues at the edges of the cache are symmetrical. > > If we replenish the cache to full, and the next transaction is a put, > the cache needs to be drained. > > That's why I replenish to size/2. > > > > Not necessarily. After you fill the cache here (to 50%) you then take > elements from it. If we filled to 100%, then we'd only hit a flush if > we > freed back more elements than we just allocated. Now, that's reasonably > likely which is why I'm ok to not filling to 100%. However, doing a > refill > as part of the op, and then leaving the cache less than half full after > the > op finishes is just wasteful IMHO!
I'm targeting half full, and disregard if it is pre-op or post-op. This makes the backend transactions cache aligned in both addressing and size, which opens an opportunity to performance optimize the mempool drivers for this. > > One other point. The obvious worst case scenario we want to avoid is > constant fill/empty/fill/empty sequences. However, so long as we have > an > asymmetry in how we do fills and empty thresholds, a repeating sequence > should never occur, and even pairs of fill/empty should be very rare. > Consider the case, for example, where we fill to 100% but only drain to > 25%. For this, we can ignore scenario #2, since we should have pretty > good > cache usage. For scenario #1, of allocs/frees on a single core, > randomly > interleaved, our worst case is: > 1) alloc with cache empty, in which case we fill to 100%, then take the > alloc > 2) have a free of a burst greater than that which we just allocated, > causing an immediate flush again. > > That's pretty poor behaviour, but the thing is that after the flush we > now > have mempool at 25% + freed burst - so expected between 25-50% of cache > utilization. That's the sweet spot we want to target - after each > operation, the mempool cache should ideally be at 50%. Whether the next > op > is an alloc or a free, our cache can handle it, and likely a couple of > each > in sequence. Therefore, our possible fill/empty combinations are likely > to > be rare occurances. > > [In all this, I am making the assumption that burst size is well less > than > cache size. Also, similar logic would be applicable for the inverse > scenario, e.g. flush to empty (and fill burst) and fill to 75%] I'm not so sure about this assumption. With a cache size of 512 and a bursts of 64, the cache only holds 8 bursts. 50% is 4 bursts, and 25% is only 2 bursts. Using a replenish/drain level in the middle requires 5 bursts in either direction to pass the edge (and trigger replenish/flush). Using a replenish/drain level 25% from the edge requires only 3 bursts in the wrong direction to pass the edge (and trigger replenish/flush). Much higher probability with random get/put. > > Now, all said, I tend to agree that we want to leave space for a decent > size burst after a fill. That is why I think that filling to 75% is > reasonable. After an alloc that triggers a fill, I don't want the cache > less than 50% full, but not completely full so there is room for a free > without a flush, and similarly for a free that triggers a flush, the > cache > should not be empty, but also should not be more than half full. > > One suggestion - we could always add a simple tunable that specifies > the > margin, or reserved entries for alloc and free. We can then guide in > the > docs that the value should be e.g. "zero for apps where alloc and free > take > place on different cores. 20%-50% of cache is recommended where alloc > and > free take place on the same core" Yes, a simple tunable is a really good idea. At this point, I think we should optimize for use case #1, and go for the 50% fill level. Then we can add a tunable to optimize for use case #2 later. I will try to come up with a draft for such a follow-up patch within the next few days. The 50% fill level in this patch is not as bad for use case #2 (roughly doubling the burst miss rate from 1/8 to 1/4), compared to how bad the original algorithm is for use case #1 (very high miss probability - only two ops in the wrong direction - after drain/replenish). -Morten

