Re: [developer] two-phase scrub/resilver

Saso Kiselkov Mon, 11 Jul 2016 09:54:39 -0700

On 7/11/16 6:37 PM, Matthew Ahrens wrote:
> 
> 
>  Very cool.  We should definitely collaborate.


Agree!

>     We avoid having to do multiple passes over the data by grouping pointers
>     by proximity and sorting in a more complicated way, rather than just
>     doing a naive elevator each pass. This results in performance that,
>     depending on how badly randomized the layout is, yields upwards of 10x
>     performance improvement (and in many cases approaches sequential
>     throughput). We've not implemented the storing of the partial-sorted
>     list on disk, as that would be a far more complex undertaking than just
>     keeping it in memory. That having been said, our memory overhead is
>     essentially zero compared to on-disk metadata, i.e. we are able to sort
>     and manipulate all sortable BPs in memory in essentially the same amount
>     of space that the metadata takes up on disk. IOW, as long as your RAM
>     quantity is not too far off being 1% of your disk storage's capacity, we
>     can sort extremely well and achieve near sequential throughput.
> 
> 
> So you need to keep all the BP's in the pool in memory at once?  What
> happens if they don't fit?  Falls back on the current non-sorted scrub?

So the way it works is that we have essentially split up scrub into two
cooperating threads, the "scanner" and the "reader". The scanner pushes
BPs into per-device sorting queue, which consists of a set of two AVL
trees and a range tree. In the range tree, BPs are joined up into
extents and we allow for a certain inter-BP gap (i.e. if the offsets are
close enough, we consider it worthwhile to read them at once). We also
track the "fill" of an extent, i.e. how much of the extent is comprised
of actual readable data and how much is are inter-BP gaps. While
scanning is going on, we continuously sort these extents such that we
always keep the "juiciest" (largest and most contiguous) ones at the front.

If we hit a configurable memory cap, we have the scanner pause for a bit
and engage the reader so that it starts consuming the queue, issuing all
the reads. Previously we did this in parallel, but I found that doing
only one at a time helps overall throughput (because metadata reads are
random, whereas sorted extent reading is sequential, so it helps not
interrupting it).

> We didn't think that was OK, which is why we allowed multiple passes
> over the metadata. If you're using recordsize=8k (or volblocksize=8k),
> the metadata is 16GB per 1TB of disk space.  Though I imagine
> compression or fancy encoding of the BP could improve that somewhat.

For simplicity's sake, I've implemented the sorting such that we only do
it for level=0 blocks and only for blocks with copies=1. Removing either
of these seemed like a rather complex addition to the algorithm for
relatively little payoff. That is not to say that it wouldn't work, but
basically I was trying to keep this to a relatively simple modification.

>     I don't know whether to continue this project at this point. I'd like to
>     avoid having to competing implementations of essentially the same thing.
> 
> Let me see if I can find our existing code and share it with you all.

I'll try to throw our code up somewhere as well.

Cheers,
-- 
Saso


-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Re: [developer] two-phase scrub/resilver

Reply via email to