On 14/03/2017 5:11 PM, Eric Dumazet wrote:
When adding order-0 pages allocations and page recycling in receive path, I added issues on PowerPC, or more generally on arches with large pages. A GRO packet, aggregating 45 segments, ended up using 45 page frags on 45 different pages. Before my changes we were very likely packing up to 42 Ethernet frames per 64KB page. 1) At skb freeing time, all put_page() on the skb frags now touch 45 different 'struct page' and this adds more cache line misses. Too bad that standard Ethernet MTU is so small :/ 2) Using one order-0 page per ring slot consumes ~42 times more memory on PowerPC. 3) Allocating order-0 pages is very likely to use pages from very different locations, increasing TLB pressure on hosts with more than 256 GB of memory after days of uptime. This patch uses a refined strategy, addressing these points. We still use order-0 pages, but the page recyling technique is modified so that we have better chances to lower number of pages containing the frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC) Page allocations are split in two halves : - One currently visible by the NIC for DMA operations. - The other contains pages that already added to old skbs, put in a quarantine. When we receive a frame, we look at the oldest entry in the pool and check if the page count is back to one, meaning old skbs/frags were consumed and the page can be recycled. Page allocations are attempted using high order ones, trying to lower TLB pressure. We remember in ring->rx_alloc_order the last attempted order and quickly decrement it in case of failures. Then mlx4_en_recover_from_oom() called every 250 msec will attempt to gradually restore rx_alloc_order to its optimal value. On x86, memory allocations stay the same. (One page per RX slot for MTU=1500) But on PowerPC, this patch considerably reduces the allocated memory. Performance gain on PowerPC is about 50% for a single TCP flow. On x86, I could not measure the difference, my test machine being limited by the sender (33 Gbit per TCP flow). 22 less cache line misses per 64 KB GRO packet is probably in the order of 2 % or so. Signed-off-by: Eric Dumazet <eduma...@google.com> Cc: Tariq Toukan <tar...@mellanox.com> Cc: Saeed Mahameed <sae...@mellanox.com> Cc: Alexander Duyck <alexander.du...@gmail.com> --- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 470 ++++++++++++++++----------- drivers/net/ethernet/mellanox/mlx4/en_tx.c | 15 +- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 54 ++- 3 files changed, 317 insertions(+), 222 deletions(-)
Hi Eric, Thanks for your patch. I will do the XDP tests and complete the review, by tomorrow. Regards, Tariq