> > I am talking about different thing: > > I think with some extra effort driver can use (in some cases) > > rte_mbuf_raw_free_bulk() even when RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE > > is not specified. > > Let say we can make txq->fast_free_mp[] an array with the same size as txq- > > >txep[]. > > At tx_burst() when filling txep[] we can do pre_free() checks for that mbuf, > > and in case of success store it's mempool pointer in corresponding txq- > > >fast_free_mp[], > > otherwise put NULL there. > > Then at tx_free() we can scan fast_free_mp[] and invoke raw_free() for > > non- > > NULL entries. > > Again, for now it is just an idea probably worth to think about. > > Yes, that seems like an excellent idea, certainly worth considering! > > At tx_free(), the mbufs might be cold, so not accessing them at this point > improves performance. (Which is also the point of my > patch.)
Yes. > > At tx_burst(), the mbufs are read anyway (their information is written into > the tx descriptors), so the mbufs are hot in the cache at > this point. Yes. > Best case with your suggestion, rte_pktmbuf_prefree_seg() doesn't write the > mbuf, so the performance cost of doing it at tx_burst() > is extremely low. Yes. > Worst case with your suggestion, rte_pktmbuf_prefree_seg() does write the > mbuf, so the mbuf write operation simply moves from > tx_free() to tx_burst(). > However, in tx_burst(), the mbuf is already hot in the cache, so per > transmitted mbuf, we get one load+store at tx_burst() instead of > one load at tx_burst() + one load+store at tx_free(). I suppose you plan to invoke full rte_pktmbuf_prefree_seg() here? Unfortunately, I don't think it is possible - for cases when refcnt > 1, we need to decrement refcnt only when we are ready to release the mbuf. Otherwise we can end up with NIC HW reading from already released (and probably re-used) mbuf. What we probably need is a lightweight version of rte_pktmbuf_prefree_seg() that would return not-NULL value only when refcnt==1, and segment and not indirect mbuf or external memory attached. Something like: static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_check(sconst truct rte_mbuf *m) { if (rte_mbuf_refcnt_read(m) == 1 && RTE_MBUF_DIRECT(m)) return m; return NULL; } So at worst case (when such check will return NULL) we still need to do load+store at tx_free().