This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep.
Sushanth --- On Mon, 11/12/12, Konstantin Belousov <[email protected]> wrote: > From: Konstantin Belousov <[email protected]> > Subject: Re: Memory reserves or lack thereof > To: [email protected] > Cc: [email protected], "Sears, Steven" <[email protected]>, > "[email protected]" <[email protected]> > Date: Monday, November 12, 2012, 5:36 AM > On Sun, Nov 11, 2012 at 03:40:24PM > -0600, Alan Cox wrote: > > On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov > <[email protected]>wrote: > > > > > On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, > Steven wrote: > > > > I have a memory subsystem design question > that I'm hoping someone can > > > answer. > > > > > > > > I've been looking at a machine that is > completely out of memory, as in > > > > > > > > v_free_count = 0, > > > > v_cache_count = 0, > > > > > > > > I wondered how a machine could completely run > out of memory like this, > > > especially after finding a lack of interrupt > storms or other pathologies > > > that would tend to overcommit memory. So I started > investigating. > > > > > > > > Most allocators come down to vm_page_alloc(), > which has this guard: > > > > > > > > if ((curproc > == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) { > > > > > page_req = VM_ALLOC_SYSTEM; > > > > }; > > > > > > > > if > (cnt.v_free_count + cnt.v_cache_count > > cnt.v_free_reserved || > > > > > (page_req == VM_ALLOC_SYSTEM && > > > > > cnt.v_free_count + cnt.v_cache_count > > > > cnt.v_interrupt_free_min) || > > > > > (page_req == VM_ALLOC_INTERRUPT > && > > > > > cnt.v_free_count + cnt.v_cache_count > > 0)) { > > > > > > > > The key observation is if VM_ALLOC_INTERRUPT > is set, it will allocate > > > every last page. > > > > > > > > >From the name one might expect > VM_ALLOC_INTERRUPT to be somewhat rare, > > > perhaps only used from interrupt threads. Not so, > see kmem_malloc() or > > > uma_small_alloc() which both contain this > mapping: > > > > > > > > if ((flags > & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > > > > > pflags = VM_ALLOC_INTERRUPT | > VM_ALLOC_WIRED; > > > > else > > > > > pflags = VM_ALLOC_SYSTEM | > VM_ALLOC_WIRED; > > > > > > > > Note that M_USE_RESERVE has been deprecated > and is used in just a > > > handful of places. Also note that lots of code > paths come through these > > > routines. > > > > > > > > What this means is essentially _any_ > allocation using M_NOWAIT will > > > bypass whatever reserves have been held back and > will take every last page > > > available. > > > > > > > > There is no documentation stating M_NOWAIT > has this side effect of > > > essentially being privileged, so any innocuous > piece of code that can't > > > block will use it. And of course M_NOWAIT is > literally used all over. > > > > > > > > It looks to me like the design goal of the > BSD allocators is on > > > recovery; it will give all pages away knowing it > can recover. > > > > > > > > Am I missing anything? I would have expected > some small number of pages > > > to be held in reserve just in case. And I didn't > expect M_NOWAIT to be a > > > sort of back door for grabbing memory. > > > > > > > > > > Your analysis is right, there is nothing to add or > correct. > > > This is the reason to strongly prefer M_WAITOK. > > > > > > > Agreed. Once upon time, before SMPng, M_NOWAIT > was rarely used. It was > > well understand that it should only be used by > interrupt handlers. > > > > The trouble is that M_NOWAIT conflates two orthogonal > things. The obvious > > being that the allocation shouldn't sleep. The > other being how far we're > > willing to deplete the cache/free page queues. > > > > When fine-grained locking got sprinkled throughout the > kernel, we all to > > often found ourselves wanting to do allocations without > the possibility of > > blocking. So, M_NOWAIT became commonplace, where > it wasn't before. > > > > This had the unintended consequence of introducing a > lot of memory > > allocations in the top-half of the kernel, i.e., > non-interrupt handling > > code, that were digging deep into the cache/free page > queues. > > > > Also, ironically, in today's kernel an "M_NOWAIT | > M_USE_RESERVE" > > allocation is less likely to succeed than an "M_NOWAIT" > allocation. > > However, prior to FreeBSD 7.x, M_NOWAIT couldn't > allocate a cached page; it > > could only allocate a free page. M_USE_RESERVE > said that it ok to allocate > > a cached page even though M_NOWAIT was specified. > Consequently, the system > > wouldn't dig as far into the free page queue if > M_USE_RESERVE was > > specified, because it was allowed to reclaim a cached > page. > > > > In conclusion, I think it's time that we change > M_NOWAIT so that it doesn't > > dig any deeper into the cache/free page queues than > M_WAITOK does and > > reintroduce a M_USE_RESERVE-like flag that says dig > deep into the > > cache/free page queues. The trouble is that we > then need to identify all > > of those places that are implicitly depending on the > current behavior of > > M_NOWAIT also digging deep into the cache/free page > queues so that we can > > add an explicit M_USE_RESERVE. > > > > Alan > > > > P.S. I suspect that we should also increase the size of > the "page reserve" > > that is kept for VM_ALLOC_INTERRUPT allocations in > vm_page_alloc*(). How > > many legitimate users of a new M_USE_RESERVE-like flag > in today's kernel > > could actually be satisfied by two pages? > > I am almost sure that most of people who put the M_NOWAIT > flag, do not > know the 'allow the deeper drain of free queue' effect. As > such, I believe > we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My > only expectations > of the problematic places would be in the swapout path. > > I found a single explicit use of M_USE_RESERVE in the > kernel, > so the flip is relatively simple. > > Below is the patch which I only compile-tested on amd64, and > which booted > fine. > > Peter, could you, please, give it a run, to see obvious > deadlocks, if any ? > > diff --git a/sys/amd64/amd64/uma_machdep.c > b/sys/amd64/amd64/uma_machdep.c > index dc9c307..ab1e869 100644 > --- a/sys/amd64/amd64/uma_machdep.c > +++ b/sys/amd64/amd64/uma_machdep.c > @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$"); > > #include <sys/param.h> > #include <sys/lock.h> > +#include <sys/malloc.h> > #include <sys/mutex.h> > #include <sys/systm.h> > #include <vm/vm.h> > @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int > bytes, u_int8_t *flags, int wait) > int pflags; > > *flags = UMA_SLAB_PRIV; > - if ((wait & > (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; > - else > - pflags = > VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; > - if (wait & M_ZERO) > - pflags |= > VM_ALLOC_ZERO; > + pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ > | VM_ALLOC_WIRED); > for (;;) { > m = > vm_page_alloc(NULL, 0, pflags); > if (m == NULL) { > diff --git a/sys/arm/arm/vm_machdep.c > b/sys/arm/arm/vm_machdep.c > index f60cdb1..75366e3 100644 > --- a/sys/arm/arm/vm_machdep.c > +++ b/sys/arm/arm/vm_machdep.c > @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int > bytes, u_int8_t *flags, int wait) > > ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT)); > > return (ret); > } > - if ((wait & > (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - > pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > - else > - > pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > - if (wait & > M_ZERO) > - > pflags |= VM_ALLOC_ZERO; > + pflags = > m2vm_flags(wait, VM_ALLOC_WIRED); > for (;;) { > m > = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); > if > (m == NULL) { > diff --git a/sys/fs/devfs/devfs_devs.c > b/sys/fs/devfs/devfs_devs.c > index 71caa29..2ce1ca6 100644 > --- a/sys/fs/devfs/devfs_devs.c > +++ b/sys/fs/devfs/devfs_devs.c > @@ -121,7 +121,7 @@ devfs_alloc(int flags) > struct cdev *cdev; > struct timespec ts; > > - cdp = malloc(sizeof *cdp, M_CDEVP, > M_USE_RESERVE | M_ZERO | > + cdp = malloc(sizeof *cdp, M_CDEVP, > M_ZERO | > ((flags & > MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK)); > if (cdp == NULL) > return (NULL); > diff --git a/sys/ia64/ia64/uma_machdep.c > b/sys/ia64/ia64/uma_machdep.c > index 37353ff..9f77762 100644 > --- a/sys/ia64/ia64/uma_machdep.c > +++ b/sys/ia64/ia64/uma_machdep.c > @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int > bytes, u_int8_t *flags, int wait) > int pflags; > > *flags = UMA_SLAB_PRIV; > - if ((wait & > (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > - else > - pflags = > VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > - if (wait & M_ZERO) > - pflags |= > VM_ALLOC_ZERO; > + pflags = m2vm_flags(wait, > VM_ALLOC_WIRED); > > for (;;) { > m = > vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); > diff --git a/sys/mips/mips/uma_machdep.c > b/sys/mips/mips/uma_machdep.c > index 798e632..24baef0 100644 > --- a/sys/mips/mips/uma_machdep.c > +++ b/sys/mips/mips/uma_machdep.c > @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int > bytes, u_int8_t *flags, int wait) > void *va; > > *flags = UMA_SLAB_PRIV; > - > - if ((wait & > (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT; > - else > - pflags = > VM_ALLOC_SYSTEM; > + pflags = m2vm_flags(wait, 0); > > for (;;) { > m = > pmap_alloc_direct_page(0, pflags); > diff --git a/sys/powerpc/aim/mmu_oea64.c > b/sys/powerpc/aim/mmu_oea64.c > index a491680..3e320b9 100644 > --- a/sys/powerpc/aim/mmu_oea64.c > +++ b/sys/powerpc/aim/mmu_oea64.c > @@ -1369,12 +1369,7 @@ moea64_uma_page_alloc(uma_zone_t > zone, int bytes, u_int8_t *flags, int wait) > *flags = UMA_SLAB_PRIV; > needed_lock = > !PMAP_LOCKED(kernel_pmap); > > - if ((wait & > (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - > pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > - else > - > pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > - if (wait & M_ZERO) > - > pflags |= VM_ALLOC_ZERO; > + pflags = m2vm_flags(wait, > VM_ALLOC_WIRED); > > for (;;) { > > m = vm_page_alloc(NULL, 0, pflags | > VM_ALLOC_NOOBJ); > diff --git a/sys/powerpc/aim/slb.c b/sys/powerpc/aim/slb.c > index 162c7fb..3882bfa 100644 > --- a/sys/powerpc/aim/slb.c > +++ b/sys/powerpc/aim/slb.c > @@ -483,12 +483,7 @@ slb_uma_real_alloc(uma_zone_t zone, int > bytes, u_int8_t *flags, int wait) > realmax = > platform_real_maxaddr(); > > *flags = UMA_SLAB_PRIV; > - if ((wait & (M_NOWAIT | > M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; > - else > - pflags = > VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; > - if (wait & M_ZERO) > - pflags |= > VM_ALLOC_ZERO; > + pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ > | VM_ALLOC_WIRED); > > for (;;) { > m = > vm_page_alloc_contig(NULL, 0, pflags, 1, 0, realmax, > diff --git a/sys/powerpc/aim/uma_machdep.c > b/sys/powerpc/aim/uma_machdep.c > index 39deb43..23a333f 100644 > --- a/sys/powerpc/aim/uma_machdep.c > +++ b/sys/powerpc/aim/uma_machdep.c > @@ -56,12 +56,7 @@ uma_small_alloc(uma_zone_t zone, int > bytes, u_int8_t *flags, int wait) > int pflags; > > *flags = UMA_SLAB_PRIV; > - if ((wait & > (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > - else > - pflags = > VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > - if (wait & M_ZERO) > - pflags |= > VM_ALLOC_ZERO; > + pflags = m2vm_flags(wait, > VM_ALLOC_WIRED); > > for (;;) { > m = > vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); > diff --git a/sys/sparc64/sparc64/vm_machdep.c > b/sys/sparc64/sparc64/vm_machdep.c > index cdb94c7..573ab3a 100644 > --- a/sys/sparc64/sparc64/vm_machdep.c > +++ b/sys/sparc64/sparc64/vm_machdep.c > @@ -501,14 +501,7 @@ uma_small_alloc(uma_zone_t zone, int > bytes, u_int8_t *flags, int wait) > PMAP_STATS_INC(uma_nsmall_alloc); > > *flags = UMA_SLAB_PRIV; > - > - if ((wait & > (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > - else > - pflags = > VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > - > - if (wait & M_ZERO) > - pflags |= > VM_ALLOC_ZERO; > + pflags = m2vm_flags(wait, > VM_ALLOC_WIRED); > > for (;;) { > m = > vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); > diff --git a/sys/vm/vm_kern.c b/sys/vm/vm_kern.c > index 46e7f1c..e4c3704 100644 > --- a/sys/vm/vm_kern.c > +++ b/sys/vm/vm_kern.c > @@ -222,12 +222,7 @@ kmem_alloc_attr(vm_map_t map, vm_size_t > size, int flags, vm_paddr_t low, > vm_object_reference(object); > vm_map_insert(map, object, offset, addr, > addr + size, VM_PROT_ALL, > VM_PROT_ALL, 0); > - if ((flags & (M_NOWAIT | > M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY; > - else > - pflags = > VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY; > - if (flags & M_ZERO) > - pflags |= > VM_ALLOC_ZERO; > + pflags = m2vm_flags(flags, > VM_ALLOC_NOBUSY); > VM_OBJECT_LOCK(object); > end_offset = offset + size; > for (; offset < end_offset; offset += > PAGE_SIZE) { > @@ -296,14 +291,7 @@ kmem_alloc_contig(vm_map_t map, > vm_size_t size, int flags, vm_paddr_t low, > vm_object_reference(object); > vm_map_insert(map, object, offset, addr, > addr + size, VM_PROT_ALL, > VM_PROT_ALL, 0); > - if ((flags & (M_NOWAIT | > M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY; > - else > - pflags = > VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY; > - if (flags & M_ZERO) > - pflags |= > VM_ALLOC_ZERO; > - if (flags & M_NODUMP) > - pflags |= > VM_ALLOC_NODUMP; > + pflags = m2vm_flags(flags, > VM_ALLOC_NOBUSY); > VM_OBJECT_LOCK(object); > tries = 0; > retry: > @@ -487,11 +475,7 @@ kmem_back(vm_map_t map, vm_offset_t > addr, vm_size_t size, int flags) > entry->wired_count == 0 > && (entry->eflags & MAP_ENTRY_IN_TRANSITION) > == 0, ("kmem_back: entry > not found or misaligned")); > > - if ((flags & > (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - pflags = > VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > - else > - pflags = > VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > - > + pflags = m2vm_flags(flags, > VM_ALLOC_WIRED); > if (flags & M_ZERO) > pflags |= > VM_ALLOC_ZERO; > if (flags & M_NODUMP) > diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h > index 70b8416..0286a6d 100644 > --- a/sys/vm/vm_page.h > +++ b/sys/vm/vm_page.h > @@ -344,6 +344,24 @@ extern struct mtx_padalign > vm_page_queue_mtx; > #define > VM_ALLOC_COUNT_SHIFT 16 > #define > VM_ALLOC_COUNT(count) ((count) << > VM_ALLOC_COUNT_SHIFT) > > +#ifdef M_NOWAIT > +static inline int > +m2vm_flags(int malloc_flags, int alloc_flags) > +{ > + int pflags; > + > + if ((malloc_flags & (M_NOWAIT | > M_USE_RESERVE)) == M_NOWAIT) > + pflags = > VM_ALLOC_SYSTEM | alloc_flags; > + else > + pflags = > VM_ALLOC_INTERRUPT | alloc_flags; > + if (malloc_flags & M_ZERO) > + pflags |= > VM_ALLOC_ZERO; > + if (malloc_flags & M_NODUMP) > + pflags |= > VM_ALLOC_NODUMP; > + return (pflags); > +} > +#endif > + > void vm_page_busy(vm_page_t m); > void vm_page_flash(vm_page_t m); > void vm_page_io_start(vm_page_t m); > _______________________________________________ [email protected] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[email protected]"

