In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they are uncacheable if the device isn't cache coherent, reading from uncached memory is fairly slow.
patch1 reuses the read out status to getting status field of rx_desc again. patch2 avoids getting buf_phys_addr from rx_desc again in mvneta_rx_hwbm by reusing the phys_addr variable. patch3 avoids reading from tx_desc as much as possible by store what we need in local variable. We get the following performance data on Marvell BG4CT Platforms (tested with iperf): before the patch: sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns after the patch: sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns we saved 9.2% time. patch4 uses cacheable memory to store the rx buffer DMA address. We get the following performance data on Marvell BG4CT Platforms (tested with iperf): before the patch: recving 1GB in mvneta_rx_swbm() costs 1492659600 ns after the patch: recving 1GB in mvneta_rx_swbm() costs 1421565640 ns We saved 4.76% time. Basically, patch1 and patch4 do what Arnd mentioned in [1]. Hi Arnd, I added "Suggested-by you" tag, I hope you don't mind ;) Thanks [1] https://www.spinics.net/lists/netdev/msg405889.html Since v2: - add Gregory's ack to patch1 - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm() - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm() - add patch 3 to avoid reading from tx_desc as much as possible Since v1: - correct the performance data typo Jisheng Zhang (4): net: mvneta: avoid getting status from rx_desc as much as possible net: mvneta: avoid getting buf_phys_addr from rx_desc again net: mvneta: avoid reading from tx_desc as much as possible net: mvneta: Use cacheable memory to store the rx buffer DMA address drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++---------------- 1 file changed, 43 insertions(+), 37 deletions(-) -- 2.11.0