On Tue, 2017-03-14 at 21:06 -0700, Alexei Starovoitov wrote: > On Tue, Mar 14, 2017 at 08:11:43AM -0700, Eric Dumazet wrote: > > +static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv, > > + struct mlx4_en_rx_ring *ring, > > + dma_addr_t *dma, > > + unsigned int node, gfp_t gfp) > > { > > + if (unlikely(!ring->pre_allocated_count)) { > > + unsigned int order = READ_ONCE(ring->rx_alloc_order); > > + > > + page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) | > > + __GFP_NOMEMALLOC | > > + __GFP_NOWARN | > > + __GFP_NORETRY, > > + order); > > + if (page) { > > + split_page(page, order); > > + ring->pre_allocated_count = 1U << order; > > + } else { > > + if (order > 1) > > + ring->rx_alloc_order--; > > + page = __alloc_pages_node(node, gfp, 0); > > + if (unlikely(!page)) > > + return NULL; > > + ring->pre_allocated_count = 1U; > > } > > + ring->pre_allocated = page; > > + ring->rx_alloc_pages += ring->pre_allocated_count; > > } > > + page = ring->pre_allocated++; > > + ring->pre_allocated_count--; > > do you think this style of allocation can be moved into net common? > If it's a good thing then other drivers should be using it too, right?
Yes, we might do this once this proves to work well. > > > + ring->cons = 0; > > + ring->prod = 0; > > + > > + /* Page recycling works best if we have enough pages in the pool. > > + * Apply a factor of two on the minimal allocations required to > > + * populate RX rings. > > + */ > > i'm not sure how above comments matches the code below. > If my math is correct a ring of 1k elements will ask for 1024 > contiguous pages. On x86 it might be the case, unless you use MTU=900 ? On PowerPC, PAGE_SIZE=65536 65536/1536 = 42 So for 1024 elements, we need 1024/42 = ~24 pages. Thus allocating 48 pages is the goal. Rounded to next power of two (although nothing in my code really needs this additional constraint, a page pool does not have to have a power of two entries) Later, we might chose a different factor, maybe an elastic one. > > > +retry: > > + total = 0; > > + pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE; > > + pages_per_ring = roundup_pow_of_two(pages_per_ring); > > + > > + order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1); > > if you're sure it doesn't hurt the rest of the system, > why use MAX_ORDER - 1? why not MAX_ORDER? alloc_page(GFP_..., MAX_ORDER) never succeeds ;) Because of the __GRP_NOWARN you would not see the error I guess, but we would get one pesky order-0 page in the ring buffer ;) > > > > > -/* We recover from out of memory by scheduling our napi poll > > - * function (mlx4_en_process_cq), which tries to allocate > > - * all missing RX buffers (call to mlx4_en_refill_rx_buffers). > > +/* Under memory pressure, each ring->rx_alloc_order might be lowered > > + * to very small values. Periodically increase t to initial value for > > + * optimal allocations, in case stress is over. > > */ > > + for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) { > > + ring = priv->rx_ring[ring_ind]; > > + order = min_t(unsigned int, ring->rx_alloc_order + 1, > > + ring->rx_pref_alloc_order); > > + WRITE_ONCE(ring->rx_alloc_order, order); > > when recycling is effective in a matter of few seconds it will > increase ther order back to 10 and the first time the driver needs > to allocate, it will start that tedious failure loop all over again. > How about removing this periodic mlx4_en_recover_from_oom() completely > and switch to increase the order inside mlx4_alloc_page(). > Like N successful __alloc_pages_node() with order X will bump it > into order X+1. If it fails next time it will do only one failed attempt. I wanted to do the increase out of line. (not in the data path) We probably could increase only if ring->rx_alloc_pages got a significant increase since the last mlx4_en_recover_from_oom() call. (That would require a new ring->prior_rx_alloc_pages out of hot cache lines) Or maybe attempt the allocation from process context (from mlx4_en_recover_from_oom()) I have not really thought very hard about this, since data path, if really needing new pages, will very fast probe rx_alloc_order back to available pages in the zone. > > > +static bool mlx4_replenish(struct mlx4_en_priv *priv, > > + struct mlx4_en_rx_ring *ring, > > + struct mlx4_en_frag_info *frag_info) > > { > > + struct mlx4_en_page *en_page = &ring->pool.array[ring->pool.pool_idx]; > > + if (!mlx4_page_is_reusable(en_page->page)) { > > + page = mlx4_alloc_page(priv, ring, &dma, numa_mem_id(), > > + GFP_ATOMIC | __GFP_MEMALLOC); > > I don't understand why page_is_reusable is doing !page_is_pfmemalloc(page)) > check, but here you're asking for MEMALLOC pages too, so > under memory pressure the hw will dma the packet into this page, > but then the stack will still drop it, so under pressure > we'll keep alloc/free the pages from reserve. Isn't it better > to let the hw drop (since we cannot alloc and rx ring is empty) ? > What am I missing? We do not want to drop packets that might help SWAP over NFS from freeing memory and help us recovering from pressure. This is the whole point of __GFP_MEMALLOC protocol : hw wont know that a particular TCP packet must not be dropped. > > > @@ -767,10 +820,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, > > struct mlx4_en_cq *cq, int bud > > case XDP_PASS: > > break; > > case XDP_TX: > > + /* Make sure we have one page ready to replace > > this one */ > > + npage = NULL; > > + if (!ring->page_cache.index) { > > + npage = mlx4_alloc_page(priv, ring, > > + &ndma, > > numa_mem_id(), > > + GFP_ATOMIC | > > __GFP_MEMALLOC); > > + if (!npage) { > > + ring->xdp_drop++; > > + goto next; > > + } > > + } > > I think the approach to drop packets after xdp prog executed is incorrect. > In such case xdp program already parsed the packet, performed all necessary > changes and now ready to xmit it out. Dropping packet so late, just > because page_cache is empty, doesn't make any sense. Therefore please change > this part back to how it was before: allocate buffer at the end of the loop. > I think you're trying too hard to avoid _oom() callback doing allocs. Say > all rx packets went via xdp_tx and we don't have any pages to put into rx ring > at the end of this rx loop. That's still not a problem. The hw will drop > few packets and when tx completion fires on the same cpu for xdp rings we > can without locks populate corresponding rx ring. > So XDP will be able to operate without allocations even when there is > no free memory in the rest of the system. > > Also I was thinking to limit xdp rings to order 0 allocs only, but doing > large order can indeed decrease tlb misses. That has to be tested more > thoroughly and if large order will turn out to hurt we can add small patch > on top of this one to do order0 for xdp rings. > > Overall I think there are several good ideas in this patch, > but drop after xdp_tx is the showstopper and has to be addressed > before this patch can be merged. > The xmit itself can drop the packet if TX ring is full. I can do the change, but you are focusing on this allocation that will never happen but for the first packets going through XDP_TX I am fine adding overhead for the non XDP_TX, since you really want it.