> On Nov. 23, 2015, 1:39 a.m., Sean Doran wrote:
> > Hi Sašo,
> > 
> >     I just hit an interesting corner case on in the l2_rebuild_abort_lowmem 
> > logic.
> >  
> > - Import several pools; rebuild goes fine.
> > - Import pool that at system shutdown was dealing with a huge set (half a 
> > TiB) of deduped-dataset snapshot destroys.
> > - This takes about 15-20 min waiting on the spa_namespace_lock mutex (in 
> > spa_import), the l2arc_feed_thr_lock mutex, and then the spa_namespace_lock 
> > mutex again (in spa_async_thread).
> > - During this time, other activity makes the arc grow big
> > - The rebuild sees arc_reclaim_needed(), and so the several GB of DDT 
> > (among other things) on the L2 are effectively discarded.
> > 
> >    I think that the abort-in-lowmem response is a bit aggressive.   In this 
> > particular example, this is the first time arc_reclaim_needed() was true 
> > since the system was started.
> > 
> >    Maybe backing off and trying again once or twice before aborting is a 
> > reasonable thing to do in the generic case? 
> > 
> >    Also, as lundman noted, so far in openzfs nobody else expects 
> > thread_join() to work in the kernel; my approach was to extend struct 
> > l2arc_dev to hold a mutex and a condvar,
> > 
> > ```
> >         kmutex_t                l2ad_rebuild_mutex;
> >         kcondvar_t              l2ad_rebuild_cv;
> > ```
> > 
> > and to do this in l2ar_remove_vdev():
> > 
> > ```
> >         if (remdev->l2ad_rebuild_did != 0) {
> >                 /*
> >                  * N.B. it should be safe to thread_join with the rebuild
> >                  * thread while holding l2arc_dev_mtx because it is not
> >                  * accessed from anywhere in the l2arc rebuild code below
> >                  * (except for l2arc_spa_rebuild_start, which is ok).
> >                  */
> >           mutex_enter(&remdev->l2ad_rebuild_mutex);
> >           do {
> >             static int iter = 0;
> >             dprintf("ZFS: %s: waiting on condvar for rebuild thread %p, 
> > iter %d\n",
> >                    __func__, remdev->l2ad_rebuild_did, ++iter);
> >             cv_timedwait(&remdev->l2ad_rebuild_cv, 
> > &remdev->l2ad_rebuild_mutex, ddi_get_lbolt()+(hz*5));
> >           } while(!remdev->l2ad_rebuild_thread_exiting);
> >           mutex_exit(&remdev->l2ad_rebuild_mutex);
> >           dprintf("ZFS: %s: thread done %p, destroying cv and mutex, 
> > setting did to 0\n",
> >                  __func__, remdev->l2ad_rebuild_did);
> >           cv_destroy(&remdev->l2ad_rebuild_cv);
> >           mutex_destroy(&remdev->l2ad_rebuild_mutex);
> >           remdev->l2ad_rebuild_did = 0;
> >                 //thread_join(remdev->l2ad_rebuild_did);
> > ```
> > 
> > and this in l2arc_spa_rebuild_start():
> > 
> > ```
> >                 if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
> >                         VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
> > #ifdef  _KERNEL
> >                         mutex_init(&dev->l2ad_rebuild_mutex, NULL, 
> > MUTEX_DEFAULT, NULL);
> >                         cv_init(&dev->l2ad_rebuild_cv, NULL, CV_DEFAULT, 
> > NULL);
> >                         mutex_enter(&dev->l2ad_rebuild_mutex);
> >                         dev->l2ad_rebuild_thread_exiting = FALSE;
> >                         mutex_exit(&dev->l2ad_rebuild_mutex);
> >                         dev->l2ad_rebuild_did = thread_create(NULL, 0,
> >                                                               (void 
> > *)l2arc_dev_rebuild_start,
> >                                                               dev, 0, &p0, 
> > TS_RUN,
> >                                                               minclsyspri);
> > #endif
> >               }
> > ```
> > 
> >  
> > and this in l2arc_dev_rebuild_start()
> > 
> > ```
> >   mutex_enter(&dev->l2ad_rebuild_mutex);
> >   dev->l2ad_rebuild_thread_exiting = FALSE;
> >         if (!dev->l2ad_rebuild_cancel) {
> >           mutex_exit(&dev->l2ad_rebuild_mutex);
> >           VERIFY(dev->l2ad_rebuild);
> >           (void) l2arc_rebuild(dev);
> >           mutex_enter(&dev->l2ad_rebuild_mutex);
> >           dev->l2ad_rebuild = B_FALSE;
> >           mutex_exit(&dev->l2ad_rebuild_mutex);
> >         } else {
> >           mutex_exit(&dev->l2ad_rebuild_mutex);
> >         }
> >         mutex_enter(&dev->l2ad_rebuild_mutex);
> >         dev->l2ad_rebuild_thread_exiting = TRUE;
> >         mutex_exit(&dev->l2ad_rebuild_mutex);
> >         cv_broadcast(&dev->l2ad_rebuild_cv);
> >         kpreempt(KPREEMPT_SYNC);
> >         if(!dev->l2ad_rebuild_cancel && dev->l2ad_rebuild_did == thr) {
> >           // we didn't get cleaned up in l2arc_remove_vdev
> >           cv_destroy(&dev->l2ad_rebuild_cv);
> >           mutex_destroy(&dev->l2ad_rebuild_mutex);
> >           dev->l2ad_rebuild_did = 0;
> >         } else if(dev->l2ad_rebuild_did != thr) {
> >           // zero: we probably got cleaned up in l2arc_remove_vdev, 
> > nonzero: impossible?
> >                  __func__, thr, dev->l2ad_rebuild_did);
> >         }
> >         dprintf("ZFS: %s calling thread_exit(), thread %p/%p, 
> > l2ad_rebuild_cancel == %d\n",
> >                __func__, thr, dev->l2ad_rebuild_did, 
> > dev->l2ad_rebuild_cancel);
> >         thread_exit();
> > ```
> > 
> > Finally, I agree with you in your discussion with Matt Ahrens that the ~51 
> > GB lower limit on rebuildable L2 vdevs is a bit strong.
> > 
> > Consider the case where a smaller L2ARC is actually desirable because the 
> > average block size is very small and turnover is fairly high; the extra 
> > l2hdrs can be significant pollution, and will accumulate on a larger L2 
> > device.   However, there is still presumably value in holding on to some 
> > smaller number of small blocks.
> > 
> >     Sean.

In a variety of tests with the across-system-shutdown behaviour of slow pools 
destroying huge deduped snapshots and datasets, modifying the low memory test 

```

                /*
                 * Our memory pressure valve. If the system is running low
                 * on memory, rather than swamping memory with new ARC buf
                 * hdrs, we opt not to rebuild the L2ARC. At this point,
                 * however, we have already set up our L2ARC dev to chain in
                 * new metadata log blk, so the user may choose to re-add the
                 * L2ARC dev at a later time to reconstruct it (when there's
                 * less memory pressure).
                 */

                if (arc_reclaim_needed()) {
```

to

```
                if(arc_reclaim_needed() && arc_meta_used >= arc_meta_limit) {
```

seems to work reasonably for pools that are late to the party (the arc has 
warmed up) but have a bunch of IOPS-saving rebuildable data.

On the other hand, the way arc_reclaim_needed() is subtly different across 
various openzfs implementations, so this might not be a good change for the 
generic case.   Perhaps arc_meta_used should be compared against some fraction 
of arc_meta_limit instead.


- Sean


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/267/#review877
-----------------------------------------------------------


On Nov. 5, 2015, 3:31 p.m., Saso Kiselkov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.csiden.org/r/267/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2015, 3:31 p.m.)
> 
> 
> Review request for OpenZFS Developer Mailing List and Christopher Siden.
> 
> 
> Repository: illumos-gate
> 
> 
> Description
> -------
> 
> This is an upstream port of the Persistent L2ARC feature from Nexenta's repo.
> 
> 
> Diffs
> -----
> 
>   usr/src/uts/common/sys/fs/zfs.h bc9f057dd1361ae73a12375515abacd0fed820d2 
>   usr/src/uts/common/fs/zfs/vdev_label.c 
> f0924ab1e66eaa678540da8925995da6e0e2a29c 
>   usr/src/uts/common/fs/zfs/vdev.c 1c57fce4dcee909b164353181dcd8e2a29ed7946 
>   usr/src/uts/common/fs/zfs/sys/spa.h 
> 7ac78390338a44f7b7658017e1ae8fcc9beb89d6 
>   usr/src/uts/common/fs/zfs/sys/arc.h 
> 899b72114b9909098080a5d6bbad1a60808f030c 
>   usr/src/uts/common/fs/zfs/spa.c 95a6b0fae7760e8a1e8cfc1e657dc22fd9ef3245 
>   usr/src/uts/common/fs/zfs/arc.c 52f582be1633a8d462105e068ae9679c04753d24 
>   usr/src/lib/libzpool/common/sys/zfs_context.h 
> 9e4d8ed0b8ec42be75bb93f44602ac99e907cf00 
> 
> Diff: https://reviews.csiden.org/r/267/diff/
> 
> 
> Testing
> -------
> 
> Been running in Nexenta's QA process for the past 2+ months.
> 
> 
> Thanks,
> 
> Saso Kiselkov
> 
>

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to