Re: [developer] two-phase scrub/resilver

Gvozden Neskovic Mon, 11 Jul 2016 10:49:32 -0700

> We also
> track the "fill" of an extent, i.e. how much of the extent is comprised
> of actual readable data and how much is are inter-BP gaps. While
> scanning is going on, we continuously sort these extents such that we
> always keep the "juiciest" (largest and most contiguous) ones at the front.


I'm wondering what is throughput difference of reading such punched-out extents 
vs. gapless ones.

>From this, assuming uniformly random BP offsets encountered by the 'scanner', 
>it looks like the 'juiciness' 
of an extent is limited mostly by total amount of RAM you are devoting to 
Intermediate metadata.  (sorry for the terminology)
This could be modeled fairly easy for the wort case.

Whereas, if the metadata is tracked by a single sorted list, scrub i/o would be 
more spread out in
the beginning (still linear), but would have decreasing amount of holes as the 
offset increase.

Secondly, I think that the algorithm should be adaptable to different scenarios 
as much as possible.
For example, we run HPC (Lustre) on top of zfs, with >300TB, in several pools, 
per sever; so we can't assume 
strong correlation between vdev size and amount of memory needed for scrub.

Third, What about bookmarking for resuming the operation in case of restart, or 
other interruption. 
Sorted lists would have to be either saved on disk or periodically drained. Or 
there's another way...

Anyhow, if you want to collaborate on this, I'm available.

Regards,
-- 
Gvozden Neskovic
[email protected]


> On 11 Jul 2016, at 18:54, Saso Kiselkov <[email protected]> wrote:
> 
> Agree!
> 
>>    We avoid having to do multiple passes over the data by grouping pointers
>>    by proximity and sorting in a more complicated way, rather than just
>>    doing a naive elevator each pass. This results in performance that,
>>    depending on how badly randomized the layout is, yields upwards of 10x
>>    performance improvement (and in many cases approaches sequential
>>    throughput). We've not implemented the storing of the partial-sorted
>>    list on disk, as that would be a far more complex undertaking than just
>>    keeping it in memory. That having been said, our memory overhead is
>>    essentially zero compared to on-disk metadata, i.e. we are able to sort
>>    and manipulate all sortable BPs in memory in essentially the same amount
>>    of space that the metadata takes up on disk. IOW, as long as your RAM
>>    quantity is not too far off being 1% of your disk storage's capacity, we
>>    can sort extremely well and achieve near sequential throughput.
>> 
>> 
>> So you need to keep all the BP's in the pool in memory at once?  What
>> happens if they don't fit?  Falls back on the current non-sorted scrub?
> 
> So the way it works is that we have essentially split up scrub into two
> cooperating threads, the "scanner" and the "reader". The scanner pushes
> BPs into per-device sorting queue, which consists of a set of two AVL
> trees and a range tree. In the range tree, BPs are joined up into
> extents and we allow for a certain inter-BP gap (i.e. if the offsets are
> close enough, we consider it worthwhile to read them at once). We also
> track the "fill" of an extent, i.e. how much of the extent is comprised
> of actual readable data and how much is are inter-BP gaps. While
> scanning is going on, we continuously sort these extents such that we
> always keep the "juiciest" (largest and most contiguous) ones at the front.
> 
> If we hit a configurable memory cap, we have the scanner pause for a bit
> and engage the reader so that it starts consuming the queue, issuing all
> the reads. Previously we did this in parallel, but I found that doing
> only one at a time helps overall throughput (because metadata reads are
> random, whereas sorted extent reading is sequential, so it helps not
> interrupting it).
> 
>> We didn't think that was OK, which is why we allowed multiple passes
>> over the metadata. If you're using recordsize=8k (or volblocksize=8k),
>> the metadata is 16GB per 1TB of disk space.  Though I imagine
>> compression or fancy encoding of the BP could improve that somewhat.
> 
> For simplicity's sake, I've implemented the sorting such that we only do
> it for level=0 blocks and only for blocks with copies=1. Removing either
> of these seemed like a rather complex addition to the algorithm for
> relatively little payoff. That is not to say that it wouldn't work, but
> basically I was trying to keep this to a relatively simple modification.
> 
>>    I don't know whether to continue this project at this point. I'd like to
>>    avoid having to competing implementations of essentially the same thing.
>> 
>> Let me see if I can find our existing code and share it with you all.
> 
> I'll try to throw our code up somewhere as well.




-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Re: [developer] two-phase scrub/resilver

Reply via email to