> From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com] > Sent: Tuesday, 1 July 2025 10.16
[...] > I am talking about different thing: > I think with some extra effort driver can use (in some cases) > rte_mbuf_raw_free_bulk() even when RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE > is not specified. > Let say we can make txq->fast_free_mp[] an array with the same size as txq- > >txep[]. > At tx_burst() when filling txep[] we can do pre_free() checks for that mbuf, > and in case of success store it's mempool pointer in corresponding txq- > >fast_free_mp[], > otherwise put NULL there. > Then at tx_free() we can scan fast_free_mp[] and invoke raw_free() for non- > NULL entries. > Again, for now it is just an idea probably worth to think about. Yes, that seems like an excellent idea, certainly worth considering! At tx_free(), the mbufs might be cold, so not accessing them at this point improves performance. (Which is also the point of my patch.) At tx_burst(), the mbufs are read anyway (their information is written into the tx descriptors), so the mbufs are hot in the cache at this point. Best case with your suggestion, rte_pktmbuf_prefree_seg() doesn't write the mbuf, so the performance cost of doing it at tx_burst() is extremely low. Worst case with your suggestion, rte_pktmbuf_prefree_seg() does write the mbuf, so the mbuf write operation simply moves from tx_free() to tx_burst(). However, in tx_burst(), the mbuf is already hot in the cache, so per transmitted mbuf, we get one load+store at tx_burst() instead of one load at tx_burst() + one load+store at tx_free().