On Tue, 2017-02-07 at 08:06 -0800, Eric Dumazet wrote: > /* > * make sure we read the CQE after we read the ownership bit > */ > dma_rmb(); > + prefetch(frags[0].page);
Note that I would like to instead do a prefetch(frags[1].page) So I will probably change how ring->rx_info is allocated wasting all that space and forcing vmalloc() is silly : tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS * sizeof(struct mlx4_en_rx_alloc)); ring->rx_info = vzalloc_node(tmp, node); In most cases, using exactly 12 bytes per slot would allow better packing. Only one cpu is using this area, no need to force strange alignments, for the sake of avoiding a multiply !