On Thu, Mar 15, 2018 at 02:26:28PM -0700, Alexander Duyck wrote: > On Thu, Mar 15, 2018 at 12:53 PM, Matthew Wilcox <wi...@infradead.org> wrote: > > From: Matthew Wilcox <mawil...@microsoft.com> > > > > Shrink page_frag_cache from 24 to 8 bytes (a single pointer to the > > currently-in-use struct page) by using the page's refcount directly > > (instead of maintaining a bias) and storing our current progress through > > the page in the same bits currently used for page->index. We no longer > > need to reflect the page pfmemalloc state if we're storing the page > > directly. > > > > On the downside, we now call page_address() on every allocation, and we > > do an atomic_inc() rather than a non-atomic decrement, but we should > > touch the same number of cachelines and there is far less code (and > > the code is less complex). > > > > Signed-off-by: Matthew Wilcox <mawil...@microsoft.com> > > So my concern with this patch is that it is essentially trading off > CPU performance for reduced size. One of the reasons for getting away > from using the page pointer is because it is expensive to access the > page when the ref_count is bouncing on multiple cache lines.
I'm not really interested in trading off SMP scalability for reduced memory usage; if we see more than a trivial penalty then I'll withdraw this RFC. I read & understand the commits that you made to get us to the current codebase, but there's no history of this particular variation in the tree. 0e39250845c0f91acc64264709b25f7f9b85c2c3 9451980a6646ed487efce04a9df28f450935683e 540eb7bf0bbedb65277d68ab89ae43cdec3fd6ba I think that by moving the 'offset' into the struct page, we end up updating only one cacheline instead of two, which should be a performance win. I understand your concern about the cacheline bouncing between the freeing and allocating CPUs. Is cross-CPU freeing a frequent occurrence? From looking at its current usage, it seemed like the allocation and freeing were usually on the same CPU. > In > addition converting from a page to a virtual address is actually more > expensive then you would think it should be. I went through and > optimized that the best I could, but there is still a pretty > significant cost to the call. Were you able to break down where that cost occurred? It looks pretty cheap on x86-32 nohighmem: 343e: a1 00 00 00 00 mov 0x0,%eax 343f: R_386_32 mem_map 3443: 29 f3 sub %esi,%ebx 3445: 89 5a 08 mov %ebx,0x8(%edx) 3448: 29 c2 sub %eax,%edx 344a: c1 fa 05 sar $0x5,%edx 344d: c1 e2 0c shl $0xc,%edx 3450: 8d 84 13 00 00 00 c0 lea -0x40000000(%ebx,%edx,1),%eax 3457: 8b 5d f8 mov -0x8(%ebp),%ebx 345a: 8b 75 fc mov -0x4(%ebp),%esi 345d: 89 ec mov %ebp,%esp 345f: 5d pop %ebp 3460: c3 ret I did code up a variant which stored the virtual address of the offset in page->index, but I threw that away after seeing how much more code it turned into. Fortunately I had a better idea of how to implement that early this morning. I haven't even tested it yet, but it looks like better generated code: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 28d9a4c9c5fd..ec38204eb903 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4357,8 +4357,8 @@ struct page *__page_frag_cache_refill(struct page_frag_cache *nc, unsigned int offset = PAGE_SIZE << compound_order(old); if (offset > size) { - old->offset = offset; - return old; + page = old; + goto reset; } } @@ -4379,7 +4379,10 @@ struct page *__page_frag_cache_refill(struct page_frag_cache *nc, if (old) put_page(old); nc->page = page; - page->offset = PAGE_SIZE << compound_order(page); + page->private = (PAGE_SIZE << compound_order(page)) - 1; +reset: + page->index = (unsigned long)page_to_virt(page) + + (PAGE_SIZE << compound_order(page)); return page; } @@ -4402,20 +4405,26 @@ void *page_frag_alloc(struct page_frag_cache *nc, unsigned int size, gfp_t gfp_mask) { struct page *page = nc->page; - unsigned int offset = page->offset; + unsigned long addr = addr; + unsigned int offset = 0; - if (unlikely(!page || offset < size)) { + if (page) { + addr = page->index; + offset = addr & page->private; + } + + if (unlikely(offset < size)) { page = __page_frag_cache_refill(nc, size, gfp_mask); if (!page) return NULL; - offset = page->offset; + addr = page->index; } page_ref_inc(page); - offset -= size; - page->offset = offset; + addr -= size; + page->index = addr; - return page_address(page) + offset; + return (void *)addr; } EXPORT_SYMBOL(page_frag_alloc); Obviously if the problem turns out to be the cacheline thrashing rather than the call to page_to_virt, then this is pointless to test. > I won't be able to test the patches until next week, but I expect I > will probably see a noticeable regression when performing a small > packet routing test. I really appreciate you being willing to try this for me. I need to get myself a dual-socket machine to test things like this.