Re: [PATCH v5] mempool: improve cache behaviour and performance

Bruce Richardson Tue, 26 May 2026 02:40:14 -0700

On Tue, May 26, 2026 at 10:41:44AM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:[email protected]]
> > Sent: Friday, 22 May 2026 18.12
> > 
> > On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Brørup wrote:
> > > This patch refactors the mempool cache to eliminate some unexpected
> > > behaviour and reduce the mempool cache miss rate.
> > >
> > 
> > Agree in principle with most of these changes. As we dicussed at the
> > DPDK
> > summit conference, only issue I really have is with the threshold
> > limits
> > here - allocating and freeing only half the cache at a time seems
> > overly
> > conservative. Thinking about use-cases:
> > 
> > 1 for apps where alloc + free (generally Rx+Tx) is on the same core(s),
> >   then we should run (almost) entirely out of cache.
> 
> I strongly disagree about any goal to run the cache low.
> The primary goal is to minimize the cache miss (refill and replenish) rate.
> 
> > 2 for apps where we have alloc and free on different cores, then we
> > have
> >   some caches always being filled and others always being emptied
> 
> Agree.
> 
> > 
> > For case #1, we only need worry about the thresholds for the odd case
> > where we have a burst that causes us to overflow our cache (and we
> > can't increase our cache size to cope and avoid that). 
> > Otherwise the thresholds don't matter.
> 
> It seems like you assume the application only does something like this:
> Rx -> Rewrite -> Tx
> 
> In that case, the per-lcore cache only needs capacity for one burst, yes.
> With my patch, the cache can be rightsized by requesting a cache size of 2 * 
> burst size.
> (Then the fill level will be either size/2 or empty, i.e. one burst or zero.
> This also happens to meet your suggested goal about low fill level, which I 
> disagree with.)
> 
> However, I don't think that is a realistic use case.
> Many apps do something like this:
> 
>      |-> Rewrite ->|
> Rx ->|             |-> Tx
>      |-> Hold      |
>          Release ->|
>


I agree, that's why our cache size needs to be bigger than our burst size.

> They often hold back packets before they are transmitted.
> For a simple router, when the destination IP address is not in the neighbor 
> table, packets to that IP address are queued until ARP/ND has been resolved, 
> and then they are dequeued and transmitted.
> Or apps performing shaping or pacing, where packets are held back in queues, 
> and dequeued at the appropriate time.
> For such apps, the waves are much bigger (than the simple Rx->Rewrite->Tx use 
> case).
> 
> With a random enqueue/dequeue pattern, replenishing/draining the cache to 
> size/2 minimizes the probability of reaching one of its edges (empty or 
> full), triggering a "cache miss" (refill/replenish).
> 
> > However, for case #2, the thresholds are constantly involved as
> > we
> > always are going to backing store. In this case, we really want to have
> > the
> > allocs *always* fill the cache completely, and the frees completely
> > empty
> > the cache.
> 
> Agree.
> 
> > 
> > Because of this, while we want to avoid cases where we fill the cache
> > completely only to have a further free causing it to be flushed,
> > because of
> > case #2 we cannot be overly conservative in how much we free/empty.
> > Ideally, we want to fill to full less a single burst, and empty leaving
> > only a single burst in the cache. Unfortunately, we don't know what
> > those
> > burst limits are, so we have to try and guess the best behaviour from
> > everything else.
> 
> I agree about not wanting to be overly conservative.
> But in the use cases I have described for #1, I don't think a target fill 
> level of size/2 is overly conservative.
> 
> I also acknowledge that this patch doubles the mempool cache miss rate for #2.
> E.g. with a cache size of 512 and burst size of 64, the per-burst miss rate 
> will be 64 / (1/2 * 512) = 1/4, compared to 64 / 512 = 1/8 with a full 
> replenish/drain algorithm.
> 
> In theory, we could make it build time configurable to optimize mempools for 
> #2.
> But mempools are also used for other objects than mbufs, so that would have 
> unwanted side effects for non-mbuf mempools.
> 
> If we went for an algorithm targeting replenish/drain at 25 % from the edges, 
> the per-burst miss rate for #2 would be: 64 / (3/4 * 512) = 1/6.
> 
> How about addressing #2 in the release notes:
> We describe that the cache refill/drain algorithm has been changed to only 
> refill/drain to 50 % of the cache size, so pipelined applications performing 
> Rx (mempool get) and Tx (mempool put) on separate cores should configure 
> their mbuf pools with double the cache size of what they previously were to 
> achieve similar performance.
> 
> > 
> > All that said, commits with specific suggestions inline.
> > 
> > /Bruce
> > 
> > <snip>
> > 
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 2e54fc4466..432c43ab15 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats
> > {
> > >   */
> > >  struct __rte_cache_aligned rte_mempool_cache {
> > >   uint32_t size;        /**< Size of the cache */
> > > - uint32_t flushthresh; /**< Threshold before we flush excess
> > elements */
> > > + uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility
> > purposes only */
> > >   uint32_t len;         /**< Current cache count */
> > >  #ifdef RTE_LIBRTE_MEMPOOL_STATS
> > >   uint32_t unused;
> > > @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
> > >   /**
> > >    * Cache objects
> > >    *
> > > -  * Cache is allocated to this size to allow it to overflow in
> > certain
> > > -  * cases to avoid needless emptying of cache.
> > > +  * Note:
> > > +  * Cache is allocated at double size for API/ABI compatibility
> > purposes only.
> > > +  * When reducing its size at an API/ABI breaking release,
> > > +  * remember to add a cache guard after it.
> > >    */
> > >   alignas(RTE_CACHE_LINE_SIZE) void
> > *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
> > >  };
> > > @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp);
> > >   * @param cache_size
> > >   *   If cache_size is non-zero, the rte_mempool library will try to
> > >   *   limit the accesses to the common lockless pool, by maintaining
> > a
> > > - *   per-lcore object cache. This argument must be lower or equal to
> > > - *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
> > > + *   per-lcore object cache. This argument must be an even number,
> > > + *   lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n.
> > >   *   The access to the per-lcore table is of course
> > >   *   faster than the multi-producer/consumer pool. The cache can be
> > >   *   disabled if the cache_size argument is set to 0; it can be
> > useful to
> > >   *   avoid losing objects in cache.
> > > + *   Note:
> > > + *   Mempool put/get requests of more than cache_size / 2 objects
> > may be
> > > + *   partially or fully served directly by the multi-
> > producer/consumer
> > > + *   pool, to avoid the overhead of copying the objects twice
> > (instead of
> > > + *   once) when using the cache as a bounce buffer.
> > >   * @param private_data_size
> > >   *   The size of the private data appended after the mempool
> > >   *   structure. This is useful for storing some private data after
> > the
> > > @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct rte_mempool
> > *mp, void * const *obj_table,
> > >   RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> > >   RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> > >
> > > - __rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE *
> > 2);
> > > - __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> > > - __rte_assume(cache->len <= cache->flushthresh);
> > > - if (likely(cache->len + n <= cache->flushthresh)) {
> > > + __rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> > > + __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > > + __rte_assume(cache->len <= cache->size);
> > > + if (likely(cache->len + n <= cache->size)) {
> > >           /* Sufficient room in the cache for the objects. */
> > >           cache_objs = &cache->objs[cache->len];
> > >           cache->len += n;
> > > - } else if (n <= cache->flushthresh) {
> > > + } else if (n <= cache->size / 2) {
> > >           /*
> > > -          * The cache is big enough for the objects, but - as
> > detected by
> > > -          * the comparison above - has insufficient room for them.
> > > -          * Flush the cache to make room for the objects.
> > > +          * The number of objects is within the cache bounce buffer
> > limit,
> > > +          * but - as detected by the comparison above - the cache
> > has
> > > +          * insufficient room for them.
> > > +          * Flush the cache to the backend to make room for the
> > objects;
> > > +          * flush (size / 2) objects from the bottom of the cache,
> > where
> > > +          * objects are less hot, and move down the remaining
> > objects, which
> > > +          * are more hot, from the upper half of the cache.
> > >            */
> > > -         cache_objs = &cache->objs[0];
> > > -         rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > > -         cache->len = n;
> > > +         __rte_assume(cache->len > cache->size / 2);
> > > +         rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache-
> > >size / 2);
> > > +         rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
> > > +                         sizeof(void *) * (cache->len - cache->size /
> > 2));
> > > +         cache_objs = &cache->objs[cache->len - cache->size / 2];
> > > +         cache->len = cache->len - cache->size / 2 + n;
> > 
> > The flushing of only half the cache I'm not so certain about. I agree
> > that
> > we want to not flush to empty, but I also think that we want to do more
> > than a half-flush, especially since we do an enqueue to the cache
> > immediately afterwards. Consider the case where we have a cache size of
> > 128, and we do an enqueue of 32, with the cache currently full. In that
> > case we only flush 64, reducing the cache to 64, but then immediately
> > bringing it back up to 96.
> 
> I thought in depth about whether the flush/replenish sizes should consider 
> the request size or not. (E.g. if I should replenish size/2 or 
> size/2+request.)
> I decided for not considering the request size, for two reasons:
> a) It roughly doesn't matter, especially when considering a sequence of 
> random get/put requests.
> b) The size of the backend transactions become fixed, which has performance 
> benefits: With my patch, they are always size/2, so if the cache size is 2^N, 
> the backend transactions are 2^N and CPU cache aligned.
> 
> > For cases where we have pipelines with all
> > alloc
> > on one core and all free on another, this half-flush would be
> > inefficient.
> > 
> > Instead, I would look to have a lower target threshold post-flush, and
> > I
> > would suggest 25% - taking into account the newly freed buffers.
> 
> It's not good for #1.
> I agree that it is better for #2. But I don't think #2 is the likely use case.
> 
> After our discussion at the summit, I did start working a patch targeting 
> fill levels at 25% from the cache edges, but I don't think it's better; so 
> I'd rather stick with a target fill level of 50%.
> 
> > For example:
> > 
> >     /* if n > our target of 1/4 full, flush everything,
> >      * else flush so that we end up with 1/4 full after n added.
> >      */
> >     flush_count = n > cache->size/4 ? cache->len :
> >                     (cache->len + n) - cache->size/4;
> > 
> > 
> > >   } else {
> > > -         /* The request itself is too big for the cache. */
> > > +         /* The request itself is too big. */
> > >           goto driver_enqueue_stats_incremented;
> > 
> > I think original comment is better. The request itself is not too big
> > for
> > the whole mempool, just for the cache.
> 
> Ack.
> 
> > 
> > >   }
> > >
> > > @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> > >   /* The cache is a stack, so copy will be in reverse order. */
> > >   cache_objs = &cache->objs[cache->len];
> > >
> > > - __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> > > + __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > >   if (likely(n <= cache->len)) {
> > >           /* The entire request can be satisfied from the cache. */
> > >           RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > > @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> > >   for (index = 0; index < len; index++)
> > >           *obj_table++ = *--cache_objs;
> > >
> > > - /* Dequeue below would overflow mem allocated for cache? */
> > > - if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
> > > + /* Dequeue below would exceed the cache bounce buffer limit? */
> > > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> > > + if (unlikely(remaining > cache->size / 2))
> > >           goto driver_dequeue;
> > >
> > > - /* Fill the cache from the backend; fetch size + remaining
> > objects. */
> > > - ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > > -                 cache->size + remaining);
> > > + /* Fill the cache from the backend; fetch (size / 2) objects. */
> > > + ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size /
> > 2);
> > 
> > Again, the cache->size / 2 doesn't seem right here. We at most half-
> > fill
> > the cache and then take some objects from that, meaning that have just
> > done
> > a re-fill of cache but end the function with it less than half full.
> > Since
> > we take from this value, I'd suggest just filling the cache completely.
> 
> The issues at the edges of the cache are symmetrical.
> If we replenish the cache to full, and the next transaction is a put, the 
> cache needs to be drained.
> That's why I replenish to size/2.
> 

Not necessarily. After you fill the cache here (to 50%) you then take
elements from it. If we filled to 100%, then we'd only hit a flush if we
freed back more elements than we just allocated. Now, that's reasonably
likely which is why I'm ok to not filling to 100%. However, doing a refill
as part of the op, and then leaving the cache less than half full after the
op finishes is just wasteful IMHO!

One other point. The obvious worst case scenario we want to avoid is
constant fill/empty/fill/empty sequences. However, so long as we have an
asymmetry in how we do fills and empty thresholds, a repeating sequence
should never occur, and even pairs of fill/empty should be very rare.
Consider the case, for example, where we fill to 100% but only drain to
25%. For this, we can ignore scenario #2, since we should have pretty good
cache usage. For scenario #1, of allocs/frees on a single core, randomly
interleaved, our worst case is:
1) alloc with cache empty, in which case we fill to 100%, then take the
alloc
2) have a free of a burst greater than that which we just allocated,
causing an immediate flush again.

That's pretty poor behaviour, but the thing is that after the flush we now
have mempool at 25% + freed burst - so expected between 25-50% of cache
utilization. That's the sweet spot we want to target - after each
operation, the mempool cache should ideally be at 50%. Whether the next op
is an alloc or a free, our cache can handle it, and likely a couple of each
in sequence. Therefore, our possible fill/empty combinations are likely to
be rare occurances.

[In all this, I am making the assumption that burst size is well less than
cache size. Also, similar logic would be applicable for the inverse
scenario, e.g. flush to empty (and fill burst) and fill to 75%]

Now, all said, I tend to agree that we want to leave space for a decent
size burst after a fill. That is why I think that filling to 75% is
reasonable. After an alloc that triggers a fill, I don't want the cache
less than 50% full, but not completely full so there is room for a free
without a flush, and similarly for a free that triggers a flush, the cache
should not be empty, but also should not be more than half full.

One suggestion - we could always add a simple tunable that specifies the
margin, or reserved entries for alloc and free. We can then guide in the
docs that the value should be e.g. "zero for apps where alloc and free take
place on different cores. 20%-50% of cache is recommended where alloc and
free take place on the same core"

/Bruce

Re: [PATCH v5] mempool: improve cache behaviour and performance

Reply via email to