Re: Get rid of UVM_VNODE_CANPERSIST

Mark Kettenis Tue, 22 Nov 2022 06:40:29 -0800

> Date: Tue, 22 Nov 2022 15:08:30 +0100
> From: Martin Pieuchot <m...@openbsd.org>
> 
> On 18/11/22(Fri) 21:33, Mark Kettenis wrote:
> > > Date: Thu, 17 Nov 2022 20:23:37 +0100
> > > From: Mark Kettenis <mark.kette...@xs4all.nl>
> > > 
> > > > From: Jeremie Courreges-Anglas <j...@wxcvbn.org>
> > > > Date: Thu, 17 Nov 2022 18:00:21 +0100
> > > > 
> > > > On Tue, Nov 15 2022, Martin Pieuchot <m...@openbsd.org> wrote:
> > > > > UVM vnode objects include a reference count to keep track of the 
> > > > > number
> > > > > of processes that have the corresponding pages mapped in their VM 
> > > > > space.
> > > > >
> > > > > When the last process referencing a given library or executable dies,
> > > > > the reaper will munmap this object on its behalf.  When this happens 
> > > > > it
> > > > > doesn't free the associated pages to speed-up possible re-use of the
> > > > > file.  Instead the pages are placed on the inactive list but stay 
> > > > > ready
> > > > > to be pmap_enter()'d without requiring I/O as soon as a newly process
> > > > > needs to access them.
> > > > >
> > > > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
> > > > > doesn't work well with swapping [0].  For some reason when the page 
> > > > > daemon
> > > > > wants to free pages on the inactive list it tries to flush the pages 
> > > > > to
> > > > > disk and panic(9) because it needs a valid reference to the vnode to 
> > > > > do
> > > > > so.
> > > > >
> > > > > This indicates that the mechanism described above, which seems to work
> > > > > fine for RO mappings, is currently buggy in more complex situations.
> > > > > Flushing the pages when the last reference of the UVM object is 
> > > > > dropped
> > > > > also doesn't seem to be enough as bluhm@ reported [1].
> > > > >
> > > > > The diff below, which has already be committed and reverted, gets rid 
> > > > > of
> > > > > the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
> > > > > the arm64 caching bug has been found and fixed.
> > > > >
> > > > > Getting rid of this logic means more I/O will be generated and pages
> > > > > might have a faster reuse cycle.  I'm aware this might introduce a 
> > > > > small
> > > > > slowdown,
> > > > 
> > > > Numbers for my usual make -j4 in libc,
> > > > on an Unmatched riscv64 box, now:
> > > >    16m32.65s real    21m36.79s user    30m53.45s system
> > > >    16m32.37s real    21m33.40s user    31m17.98s system
> > > >    16m32.63s real    21m35.74s user    31m12.01s system
> > > >    16m32.13s real    21m36.12s user    31m06.92s system
> > > > After:
> > > >    19m14.15s real    21m09.39s user    36m51.33s system
> > > >    19m19.11s real    21m02.61s user    36m58.46s system
> > > >    19m21.77s real    21m09.23s user    37m03.85s system
> > > >    19m09.39s real    21m08.96s user    36m36.00s system
> > > > 
> > > > 4 cores amd64 VM, before (-current plus an other diff):
> > > >    1m54.31s real     2m47.36s user     4m24.70s system
> > > >    1m52.64s real     2m45.68s user     4m23.46s system
> > > >    1m53.47s real     2m43.59s user     4m27.60s system
> > > > After:
> > > >    2m34.12s real     2m51.15s user     6m20.91s system
> > > >    2m34.30s real     2m48.48s user     6m23.34s system
> > > >    2m37.07s real     2m49.60s user     6m31.53s system
> > > > 
> > > > > however I believe we should work towards loading files from the
> > > > > buffer cache to save I/O cycles instead of having another layer of 
> > > > > cache.
> > > > > Such work isn't trivial and making sure the vnode <-> UVM relation is
> > > > > simple and well understood is the first step in this direction.
> > > > >
> > > > > I'd appreciate if the diff below could be tested on many 
> > > > > architectures,
> > > > > include the offending rpi4.
> > > > 
> > > > Mike has already tested a make build on a riscv64 Unmatched.  I have
> > > > also run regress in sys, lib/libc and lib/libpthread on that arch.  As
> > > > far as I can see this looks stable on my machine, but what I really care
> > > > about is the riscv64 bulk build cluster (I'm going to start another
> > > > bulk build soon).
> > > > 
> > > > > Comments?  Oks?
> > > > 
> > > > The performance drop in my microbenchmark kinda worries me but it's only
> > > > a microbenchmark...
> > > 
> > > I wouldn't call this a microbenchmark.  I fear this is typical for
> > > builds of anything on clang architectures.  And I expect it to be
> > > worse on single-processor machine where *every* time we execute clang
> > > or lld all the pages are thrown away upon exit and will need to be
> > > read back from disk again for the next bit of C code we compile.
> > > 
> > > Having mmap'ed reads go through the buffer cache should indeed help
> > > here, but that smells like UBC and even if it isn't, it is something
> > > we don't have and will be tricky to get right.
> > > 
> > > The discussions we had during h2k22 made me understand this thing a
> > > bit better.  Is there any reason why we can't keep a reference to the
> > > vnode and have the pagedaemon drop it when it drops the pages?  Is
> > > there a chance that we run out of vnodes this way?  I vaguely remember
> > > beck@ saying that we can have tons of vnodes now.
> > 
> > Maybe that isn't the best idea.  So here is what may be an actual fix
> > for the "macppc panic: vref used where vget required" problem we're
> > seeing that prompted us to look at UVM_VNODE_CANPERSIST.
> > 
> > If we look back to at the race that bluhm@ posted:
> > 
> > panic: vref used where vget required
> > Stopped at      db_enter+0x24:  lwz r11,12(r1)
> >     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> >  192060  78628     21         0x2          0    0  c++
> > *132472  74971      0     0x14000      0x200    1K pagedaemon
> > db_enter() at db_enter+0x20
> > panic(91373c) at panic+0x158
> > vref(23b8fa20) at vref+0xac
> > uvm_vnp_uncache(e7eb7c50) at uvm_vnp_uncache+0x88
> > ffs_write(7ed2e423) at ffs_write+0x3b0
> > VOP_WRITE(23b8fa20,e7eb7c50,40,1ff3f60) at VOP_WRITE+0x48
> > uvn_io(7ed2e423,a93d0c,a93814,ffffffff,e4010000) at uvn_io+0x264
> > uvn_put(414b673a,e7eb7dd4,24f00070,5326e90) at uvn_put+0x64
> > uvm_pager_put(0,0,e7eb7d70,6ee0b8,2000000,80000000,0) at uvm_pager_put+0x15c
> > uvmpd_scan_inactive(0) at uvmpd_scan_inactive+0x224
> > uvmpd_scan() at uvmpd_scan+0x158
> > uvm_pageout(7e932633) at uvm_pageout+0x398
> > fork_trampoline() at fork_trampoline+0x14
> > end trace frame: 0x0, count: 2
> > 
> > I don't think there is anything complicated going on.  The pagedaemon
> > is paging out a page associated with a persisting vnode.  That page
> > isn't mapped anymore so nobody holds a reference to the vnode anymore
> > and it sits happily on the freelist to either be reused or recycled.
> > 
> > The code repsonsible for paging out this page is uvn_put(), which
> > calls uvn_io() and eventually ffs_write() *without* holding a
> > reference to the vnode.  I believe that is wrong; ffs_write() clearly
> > assumes that its caller holds a reference to vnode.  But it doesn't
> > check until we call uvm_vnp_uncache() which attempts to grab another
> > reference, which triggers the diagnostic panic we're seeing.
> > 
> > So I think uvn_put() should make sure it holds a reference to the
> > vnode.  And since it won't hold one if called by the pagedaemon, this
> > means we should use vget() to get one.  Now there may be a risk of a
> > deadlock here.  The pagedaemon may be trying to page out a page that
> > is associated with a vnode that is already in the process of being
> > recycled.  But cleaning the vnode as part of the recycling process
> > requires the vmobjlock that we're holding.  So use LK_NOWAIT and trust
> > that the page will be written to disk and freed as the associated
> > vnode gets cleaned.  Which means the pagedaemon can simply skip the
> > page so we return VM_PAGER_PEND.
> > 
> > Thoughts?  If this works, I probably should add some comments
> > explaining what's going on here.
> 
> I like the idea and the diff.
> 
> I don't like the use of VM_PAGER_PEND.  This return value is used to
> indicate that I/O has been started asynchronously.  This currently never
> happens with vnode so we're introducing a new tricky error code path. 
> 
> Plus the page hasn't been flushed to disk when uvm_pager_put() returns.
> I'd rather return VM_PAGER_AGAIN which would leave the page on the queue
> and once the page daemon unlocks `vmobjlock' the racing thread will be
> able to proceed.


Hmm, I did consider returning VM_PAGER_AGAIN but switched to
VM_PAGER_PEND after I thought I convinced myself this would loop
inside uvm_pager_put() without ever dropping the vmobjlock.  But
looking at the code again that isn't the case.  It retries once on a
single page after dropping the cluster but then it returns
VM_PAGER_AGAIN.  So yes, that should be better.

Here is a diff.  Maybe bluhm@ can try this on the macppc machine that
triggered the original "vref used where vget required" problem?


Index: uvm/uvm_vnode.c
===================================================================
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.130
diff -u -p -r1.130 uvm_vnode.c
--- uvm/uvm_vnode.c     20 Oct 2022 13:31:52 -0000      1.130
+++ uvm/uvm_vnode.c     22 Nov 2022 14:37:25 -0000
@@ -899,11 +899,18 @@ uvn_cluster(struct uvm_object *uobj, vof
 int
 uvn_put(struct uvm_object *uobj, struct vm_page **pps, int npages, int flags)
 {
+       struct uvm_vnode *uvn = (struct uvm_vnode *)uobj;
        int retval;
 
        KASSERT(rw_write_held(uobj->vmobjlock));
 
+       KASSERT(uvn->u_flags & UVM_VNODE_VALID);
+       if (vget(uvn->u_vnode, LK_NOWAIT))
+               return VM_PAGER_AGAIN;
+
        retval = uvn_io((struct uvm_vnode*)uobj, pps, npages, flags, UIO_WRITE);
+
+       vrele(uvn->u_vnode);
 
        return retval;
 }

Re: Get rid of UVM_VNODE_CANPERSIST

Reply via email to