-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/267/#review877
-----------------------------------------------------------
Hi Sašo,
I just hit an interesting corner case on in the l2_rebuild_abort_lowmem
logic.
- Import several pools; rebuild goes fine.
- Import pool that at system shutdown was dealing with a huge set (half a TiB)
of deduped-dataset snapshot destroys.
- This takes about 15-20 min waiting on the spa_namespace_lock mutex (in
spa_import), the l2arc_feed_thr_lock mutex, and then the spa_namespace_lock
mutex again (in spa_async_thread).
- During this time, other activity makes the arc grow big
- The rebuild sees arc_reclaim_needed(), and so the several GB of DDT (among
other things) on the L2 are effectively discarded.
I think that the abort-in-lowmem response is a bit aggressive. In this
particular example, this is the first time arc_reclaim_needed() was true since
the system was started.
Maybe backing off and trying again once or twice before aborting is a
reasonable thing to do in the generic case?
Also, as lundman noted, so far in openzfs nobody else expects thread_join()
to work in the kernel; my approach was to extend struct l2arc_dev to hold a
mutex and a condvar,
```
kmutex_t l2ad_rebuild_mutex;
kcondvar_t l2ad_rebuild_cv;
```
and to do this in l2ar_remove_vdev():
```
if (remdev->l2ad_rebuild_did != 0) {
/*
* N.B. it should be safe to thread_join with the rebuild
* thread while holding l2arc_dev_mtx because it is not
* accessed from anywhere in the l2arc rebuild code below
* (except for l2arc_spa_rebuild_start, which is ok).
*/
mutex_enter(&remdev->l2ad_rebuild_mutex);
do {
static int iter = 0;
dprintf("ZFS: %s: waiting on condvar for rebuild thread %p, iter
%d\n",
__func__, remdev->l2ad_rebuild_did, ++iter);
cv_timedwait(&remdev->l2ad_rebuild_cv, &remdev->l2ad_rebuild_mutex,
ddi_get_lbolt()+(hz*5));
} while(!remdev->l2ad_rebuild_thread_exiting);
mutex_exit(&remdev->l2ad_rebuild_mutex);
dprintf("ZFS: %s: thread done %p, destroying cv and mutex, setting
did to 0\n",
__func__, remdev->l2ad_rebuild_did);
cv_destroy(&remdev->l2ad_rebuild_cv);
mutex_destroy(&remdev->l2ad_rebuild_mutex);
remdev->l2ad_rebuild_did = 0;
//thread_join(remdev->l2ad_rebuild_did);
```
and this in l2arc_spa_rebuild_start():
```
if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
#ifdef _KERNEL
mutex_init(&dev->l2ad_rebuild_mutex, NULL,
MUTEX_DEFAULT, NULL);
cv_init(&dev->l2ad_rebuild_cv, NULL, CV_DEFAULT, NULL);
mutex_enter(&dev->l2ad_rebuild_mutex);
dev->l2ad_rebuild_thread_exiting = FALSE;
mutex_exit(&dev->l2ad_rebuild_mutex);
dev->l2ad_rebuild_did = thread_create(NULL, 0,
(void
*)l2arc_dev_rebuild_start,
dev, 0, &p0,
TS_RUN,
minclsyspri);
#endif
}
```
and this in l2arc_dev_rebuild_start()
```
mutex_enter(&dev->l2ad_rebuild_mutex);
dev->l2ad_rebuild_thread_exiting = FALSE;
if (!dev->l2ad_rebuild_cancel) {
mutex_exit(&dev->l2ad_rebuild_mutex);
VERIFY(dev->l2ad_rebuild);
(void) l2arc_rebuild(dev);
mutex_enter(&dev->l2ad_rebuild_mutex);
dev->l2ad_rebuild = B_FALSE;
mutex_exit(&dev->l2ad_rebuild_mutex);
} else {
mutex_exit(&dev->l2ad_rebuild_mutex);
}
mutex_enter(&dev->l2ad_rebuild_mutex);
dev->l2ad_rebuild_thread_exiting = TRUE;
mutex_exit(&dev->l2ad_rebuild_mutex);
cv_broadcast(&dev->l2ad_rebuild_cv);
kpreempt(KPREEMPT_SYNC);
if(!dev->l2ad_rebuild_cancel && dev->l2ad_rebuild_did == thr) {
// we didn't get cleaned up in l2arc_remove_vdev
cv_destroy(&dev->l2ad_rebuild_cv);
mutex_destroy(&dev->l2ad_rebuild_mutex);
dev->l2ad_rebuild_did = 0;
} else if(dev->l2ad_rebuild_did != thr) {
// zero: we probably got cleaned up in l2arc_remove_vdev, nonzero:
impossible?
__func__, thr, dev->l2ad_rebuild_did);
}
dprintf("ZFS: %s calling thread_exit(), thread %p/%p,
l2ad_rebuild_cancel == %d\n",
__func__, thr, dev->l2ad_rebuild_did, dev->l2ad_rebuild_cancel);
thread_exit();
```
Finally, I agree with you in your discussion with Matt Ahrens that the ~51 GB
lower limit on rebuildable L2 vdevs is a bit strong.
Consider the case where a smaller L2ARC is actually desirable because the
average block size is very small and turnover is fairly high; the extra l2hdrs
can be significant pollution, and will accumulate on a larger L2 device.
However, there is still presumably value in holding on to some smaller number
of small blocks.
Sean.
- Sean Doran
On Nov. 5, 2015, 3:31 p.m., Saso Kiselkov wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.csiden.org/r/267/
> -----------------------------------------------------------
>
> (Updated Nov. 5, 2015, 3:31 p.m.)
>
>
> Review request for OpenZFS Developer Mailing List and Christopher Siden.
>
>
> Repository: illumos-gate
>
>
> Description
> -------
>
> This is an upstream port of the Persistent L2ARC feature from Nexenta's repo.
>
>
> Diffs
> -----
>
> usr/src/uts/common/sys/fs/zfs.h bc9f057dd1361ae73a12375515abacd0fed820d2
> usr/src/uts/common/fs/zfs/vdev_label.c
> f0924ab1e66eaa678540da8925995da6e0e2a29c
> usr/src/uts/common/fs/zfs/vdev.c 1c57fce4dcee909b164353181dcd8e2a29ed7946
> usr/src/uts/common/fs/zfs/sys/spa.h
> 7ac78390338a44f7b7658017e1ae8fcc9beb89d6
> usr/src/uts/common/fs/zfs/sys/arc.h
> 899b72114b9909098080a5d6bbad1a60808f030c
> usr/src/uts/common/fs/zfs/spa.c 95a6b0fae7760e8a1e8cfc1e657dc22fd9ef3245
> usr/src/uts/common/fs/zfs/arc.c 52f582be1633a8d462105e068ae9679c04753d24
> usr/src/lib/libzpool/common/sys/zfs_context.h
> 9e4d8ed0b8ec42be75bb93f44602ac99e907cf00
>
> Diff: https://reviews.csiden.org/r/267/diff/
>
>
> Testing
> -------
>
> Been running in Nexenta's QA process for the past 2+ months.
>
>
> Thanks,
>
> Saso Kiselkov
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer