> On Nov. 23, 2015, 1:39 a.m., Sean Doran wrote:
> > Hi Sašo,
> >
> > I just hit an interesting corner case on in the l2_rebuild_abort_lowmem
> > logic.
> >
> > - Import several pools; rebuild goes fine.
> > - Import pool that at system shutdown was dealing with a huge set (half a
> > TiB) of deduped-dataset snapshot destroys.
> > - This takes about 15-20 min waiting on the spa_namespace_lock mutex (in
> > spa_import), the l2arc_feed_thr_lock mutex, and then the spa_namespace_lock
> > mutex again (in spa_async_thread).
> > - During this time, other activity makes the arc grow big
> > - The rebuild sees arc_reclaim_needed(), and so the several GB of DDT
> > (among other things) on the L2 are effectively discarded.
> >
> > I think that the abort-in-lowmem response is a bit aggressive. In this
> > particular example, this is the first time arc_reclaim_needed() was true
> > since the system was started.
> >
> > Maybe backing off and trying again once or twice before aborting is a
> > reasonable thing to do in the generic case?
> >
> > Also, as lundman noted, so far in openzfs nobody else expects
> > thread_join() to work in the kernel; my approach was to extend struct
> > l2arc_dev to hold a mutex and a condvar,
> >
> > ```
> > kmutex_t l2ad_rebuild_mutex;
> > kcondvar_t l2ad_rebuild_cv;
> > ```
> >
> > and to do this in l2ar_remove_vdev():
> >
> > ```
> > if (remdev->l2ad_rebuild_did != 0) {
> > /*
> > * N.B. it should be safe to thread_join with the rebuild
> > * thread while holding l2arc_dev_mtx because it is not
> > * accessed from anywhere in the l2arc rebuild code below
> > * (except for l2arc_spa_rebuild_start, which is ok).
> > */
> > mutex_enter(&remdev->l2ad_rebuild_mutex);
> > do {
> > static int iter = 0;
> > dprintf("ZFS: %s: waiting on condvar for rebuild thread %p,
> > iter %d\n",
> > __func__, remdev->l2ad_rebuild_did, ++iter);
> > cv_timedwait(&remdev->l2ad_rebuild_cv,
> > &remdev->l2ad_rebuild_mutex, ddi_get_lbolt()+(hz*5));
> > } while(!remdev->l2ad_rebuild_thread_exiting);
> > mutex_exit(&remdev->l2ad_rebuild_mutex);
> > dprintf("ZFS: %s: thread done %p, destroying cv and mutex,
> > setting did to 0\n",
> > __func__, remdev->l2ad_rebuild_did);
> > cv_destroy(&remdev->l2ad_rebuild_cv);
> > mutex_destroy(&remdev->l2ad_rebuild_mutex);
> > remdev->l2ad_rebuild_did = 0;
> > //thread_join(remdev->l2ad_rebuild_did);
> > ```
> >
> > and this in l2arc_spa_rebuild_start():
> >
> > ```
> > if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
> > VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
> > #ifdef _KERNEL
> > mutex_init(&dev->l2ad_rebuild_mutex, NULL,
> > MUTEX_DEFAULT, NULL);
> > cv_init(&dev->l2ad_rebuild_cv, NULL, CV_DEFAULT,
> > NULL);
> > mutex_enter(&dev->l2ad_rebuild_mutex);
> > dev->l2ad_rebuild_thread_exiting = FALSE;
> > mutex_exit(&dev->l2ad_rebuild_mutex);
> > dev->l2ad_rebuild_did = thread_create(NULL, 0,
> > (void
> > *)l2arc_dev_rebuild_start,
> > dev, 0, &p0,
> > TS_RUN,
> > minclsyspri);
> > #endif
> > }
> > ```
> >
> >
> > and this in l2arc_dev_rebuild_start()
> >
> > ```
> > mutex_enter(&dev->l2ad_rebuild_mutex);
> > dev->l2ad_rebuild_thread_exiting = FALSE;
> > if (!dev->l2ad_rebuild_cancel) {
> > mutex_exit(&dev->l2ad_rebuild_mutex);
> > VERIFY(dev->l2ad_rebuild);
> > (void) l2arc_rebuild(dev);
> > mutex_enter(&dev->l2ad_rebuild_mutex);
> > dev->l2ad_rebuild = B_FALSE;
> > mutex_exit(&dev->l2ad_rebuild_mutex);
> > } else {
> > mutex_exit(&dev->l2ad_rebuild_mutex);
> > }
> > mutex_enter(&dev->l2ad_rebuild_mutex);
> > dev->l2ad_rebuild_thread_exiting = TRUE;
> > mutex_exit(&dev->l2ad_rebuild_mutex);
> > cv_broadcast(&dev->l2ad_rebuild_cv);
> > kpreempt(KPREEMPT_SYNC);
> > if(!dev->l2ad_rebuild_cancel && dev->l2ad_rebuild_did == thr) {
> > // we didn't get cleaned up in l2arc_remove_vdev
> > cv_destroy(&dev->l2ad_rebuild_cv);
> > mutex_destroy(&dev->l2ad_rebuild_mutex);
> > dev->l2ad_rebuild_did = 0;
> > } else if(dev->l2ad_rebuild_did != thr) {
> > // zero: we probably got cleaned up in l2arc_remove_vdev,
> > nonzero: impossible?
> > __func__, thr, dev->l2ad_rebuild_did);
> > }
> > dprintf("ZFS: %s calling thread_exit(), thread %p/%p,
> > l2ad_rebuild_cancel == %d\n",
> > __func__, thr, dev->l2ad_rebuild_did,
> > dev->l2ad_rebuild_cancel);
> > thread_exit();
> > ```
> >
> > Finally, I agree with you in your discussion with Matt Ahrens that the ~51
> > GB lower limit on rebuildable L2 vdevs is a bit strong.
> >
> > Consider the case where a smaller L2ARC is actually desirable because the
> > average block size is very small and turnover is fairly high; the extra
> > l2hdrs can be significant pollution, and will accumulate on a larger L2
> > device. However, there is still presumably value in holding on to some
> > smaller number of small blocks.
> >
> > Sean.
In a variety of tests with the across-system-shutdown behaviour of slow pools
destroying huge deduped snapshots and datasets, modifying the low memory test
```
/*
* Our memory pressure valve. If the system is running low
* on memory, rather than swamping memory with new ARC buf
* hdrs, we opt not to rebuild the L2ARC. At this point,
* however, we have already set up our L2ARC dev to chain in
* new metadata log blk, so the user may choose to re-add the
* L2ARC dev at a later time to reconstruct it (when there's
* less memory pressure).
*/
if (arc_reclaim_needed()) {
```
to
```
if(arc_reclaim_needed() && arc_meta_used >= arc_meta_limit) {
```
seems to work reasonably for pools that are late to the party (the arc has
warmed up) but have a bunch of IOPS-saving rebuildable data.
On the other hand, the way arc_reclaim_needed() is subtly different across
various openzfs implementations, so this might not be a good change for the
generic case. Perhaps arc_meta_used should be compared against some fraction
of arc_meta_limit instead.
- Sean
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/267/#review877
-----------------------------------------------------------
On Nov. 5, 2015, 3:31 p.m., Saso Kiselkov wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.csiden.org/r/267/
> -----------------------------------------------------------
>
> (Updated Nov. 5, 2015, 3:31 p.m.)
>
>
> Review request for OpenZFS Developer Mailing List and Christopher Siden.
>
>
> Repository: illumos-gate
>
>
> Description
> -------
>
> This is an upstream port of the Persistent L2ARC feature from Nexenta's repo.
>
>
> Diffs
> -----
>
> usr/src/uts/common/sys/fs/zfs.h bc9f057dd1361ae73a12375515abacd0fed820d2
> usr/src/uts/common/fs/zfs/vdev_label.c
> f0924ab1e66eaa678540da8925995da6e0e2a29c
> usr/src/uts/common/fs/zfs/vdev.c 1c57fce4dcee909b164353181dcd8e2a29ed7946
> usr/src/uts/common/fs/zfs/sys/spa.h
> 7ac78390338a44f7b7658017e1ae8fcc9beb89d6
> usr/src/uts/common/fs/zfs/sys/arc.h
> 899b72114b9909098080a5d6bbad1a60808f030c
> usr/src/uts/common/fs/zfs/spa.c 95a6b0fae7760e8a1e8cfc1e657dc22fd9ef3245
> usr/src/uts/common/fs/zfs/arc.c 52f582be1633a8d462105e068ae9679c04753d24
> usr/src/lib/libzpool/common/sys/zfs_context.h
> 9e4d8ed0b8ec42be75bb93f44602ac99e907cf00
>
> Diff: https://reviews.csiden.org/r/267/diff/
>
>
> Testing
> -------
>
> Been running in Nexenta's QA process for the past 2+ months.
>
>
> Thanks,
>
> Saso Kiselkov
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer