Re: [RFC 2/2] page_frag_cache: Store metadata in struct page

Matthew Wilcox Fri, 16 Mar 2018 14:06:05 -0700

On Thu, Mar 15, 2018 at 02:26:28PM -0700, Alexander Duyck wrote:
> On Thu, Mar 15, 2018 at 12:53 PM, Matthew Wilcox <wi...@infradead.org> wrote:
> > From: Matthew Wilcox <mawil...@microsoft.com>
> >
> > Shrink page_frag_cache from 24 to 8 bytes (a single pointer to the
> > currently-in-use struct page) by using the page's refcount directly
> > (instead of maintaining a bias) and storing our current progress through
> > the page in the same bits currently used for page->index.  We no longer
> > need to reflect the page pfmemalloc state if we're storing the page
> > directly.
> >
> > On the downside, we now call page_address() on every allocation, and we
> > do an atomic_inc() rather than a non-atomic decrement, but we should
> > touch the same number of cachelines and there is far less code (and
> > the code is less complex).
> >
> > Signed-off-by: Matthew Wilcox <mawil...@microsoft.com>
> 
> So my concern with this patch is that it is essentially trading off
> CPU performance for reduced size. One of the reasons for getting away
> from using the page pointer is because it is expensive to access the
> page when the ref_count is bouncing on multiple cache lines.


I'm not really interested in trading off SMP scalability for reduced
memory usage; if we see more than a trivial penalty then I'll withdraw
this RFC.  I read & understand the commits that you made to get us to
the current codebase, but there's no history of this particular variation
in the tree.

0e39250845c0f91acc64264709b25f7f9b85c2c3
9451980a6646ed487efce04a9df28f450935683e
540eb7bf0bbedb65277d68ab89ae43cdec3fd6ba

I think that by moving the 'offset' into the struct page, we end up updating
only one cacheline instead of two, which should be a performance win.
I understand your concern about the cacheline bouncing between the freeing
and allocating CPUs.  Is cross-CPU freeing a frequent occurrence?  From
looking at its current usage, it seemed like the allocation and freeing
were usually on the same CPU.

> In
> addition converting from a page to a virtual address is actually more
> expensive then you would think it should be. I went through and
> optimized that the best I could, but there is still a pretty
> significant cost to the call.

Were you able to break down where that cost occurred?  It looks
pretty cheap on x86-32 nohighmem:

    343e:       a1 00 00 00 00          mov    0x0,%eax
                        343f: R_386_32  mem_map
    3443:       29 f3                   sub    %esi,%ebx
    3445:       89 5a 08                mov    %ebx,0x8(%edx)
    3448:       29 c2                   sub    %eax,%edx
    344a:       c1 fa 05                sar    $0x5,%edx
    344d:       c1 e2 0c                shl    $0xc,%edx
    3450:       8d 84 13 00 00 00 c0    lea    -0x40000000(%ebx,%edx,1),%eax
    3457:       8b 5d f8                mov    -0x8(%ebp),%ebx
    345a:       8b 75 fc                mov    -0x4(%ebp),%esi
    345d:       89 ec                   mov    %ebp,%esp
    345f:       5d                      pop    %ebp
    3460:       c3                      ret    

I did code up a variant which stored the virtual address of the offset
in page->index, but I threw that away after seeing how much more code
it turned into.  Fortunately I had a better idea of how to implement that
early this morning.  I haven't even tested it yet, but it looks like better
generated code:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 28d9a4c9c5fd..ec38204eb903 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4357,8 +4357,8 @@ struct page *__page_frag_cache_refill(struct 
page_frag_cache *nc,
                unsigned int offset = PAGE_SIZE << compound_order(old);
 
                if (offset > size) {
-                       old->offset = offset;
-                       return old;
+                       page = old;
+                       goto reset;
                }
        }
 
@@ -4379,7 +4379,10 @@ struct page *__page_frag_cache_refill(struct 
page_frag_cache *nc,
        if (old)
                put_page(old);
        nc->page = page;
-       page->offset = PAGE_SIZE << compound_order(page);
+       page->private = (PAGE_SIZE << compound_order(page)) - 1;
+reset:
+       page->index = (unsigned long)page_to_virt(page) +
+                                       (PAGE_SIZE << compound_order(page));
        return page;
 }
 
@@ -4402,20 +4405,26 @@ void *page_frag_alloc(struct page_frag_cache *nc,
                      unsigned int size, gfp_t gfp_mask)
 {
        struct page *page = nc->page;
-       unsigned int offset = page->offset;
+       unsigned long addr = addr;
+       unsigned int offset = 0;
 
-       if (unlikely(!page || offset < size)) {
+       if (page) {
+               addr = page->index;
+               offset = addr & page->private;
+       }
+
+       if (unlikely(offset < size)) {
                page = __page_frag_cache_refill(nc, size, gfp_mask);
                if (!page)
                        return NULL;
-               offset = page->offset;
+               addr = page->index;
        }
 
        page_ref_inc(page);
-       offset -= size;
-       page->offset = offset;
+       addr -= size;
+       page->index = addr;
 
-       return page_address(page) + offset;
+       return (void *)addr;
 }
 EXPORT_SYMBOL(page_frag_alloc);
 

Obviously if the problem turns out to be the cacheline thrashing rather
than the call to page_to_virt, then this is pointless to test.

> I won't be able to test the patches until next week, but I expect I
> will probably see a noticeable regression when performing a small
> packet routing test.

I really appreciate you being willing to try this for me.  I need to
get myself a dual-socket machine to test things like this.

Re: [RFC 2/2] page_frag_cache: Store metadata in struct page

Reply via email to