On 7/11/16 6:37 PM, Matthew Ahrens wrote: > > > Very cool. We should definitely collaborate.
Agree! > We avoid having to do multiple passes over the data by grouping pointers > by proximity and sorting in a more complicated way, rather than just > doing a naive elevator each pass. This results in performance that, > depending on how badly randomized the layout is, yields upwards of 10x > performance improvement (and in many cases approaches sequential > throughput). We've not implemented the storing of the partial-sorted > list on disk, as that would be a far more complex undertaking than just > keeping it in memory. That having been said, our memory overhead is > essentially zero compared to on-disk metadata, i.e. we are able to sort > and manipulate all sortable BPs in memory in essentially the same amount > of space that the metadata takes up on disk. IOW, as long as your RAM > quantity is not too far off being 1% of your disk storage's capacity, we > can sort extremely well and achieve near sequential throughput. > > > So you need to keep all the BP's in the pool in memory at once? What > happens if they don't fit? Falls back on the current non-sorted scrub? So the way it works is that we have essentially split up scrub into two cooperating threads, the "scanner" and the "reader". The scanner pushes BPs into per-device sorting queue, which consists of a set of two AVL trees and a range tree. In the range tree, BPs are joined up into extents and we allow for a certain inter-BP gap (i.e. if the offsets are close enough, we consider it worthwhile to read them at once). We also track the "fill" of an extent, i.e. how much of the extent is comprised of actual readable data and how much is are inter-BP gaps. While scanning is going on, we continuously sort these extents such that we always keep the "juiciest" (largest and most contiguous) ones at the front. If we hit a configurable memory cap, we have the scanner pause for a bit and engage the reader so that it starts consuming the queue, issuing all the reads. Previously we did this in parallel, but I found that doing only one at a time helps overall throughput (because metadata reads are random, whereas sorted extent reading is sequential, so it helps not interrupting it). > We didn't think that was OK, which is why we allowed multiple passes > over the metadata. If you're using recordsize=8k (or volblocksize=8k), > the metadata is 16GB per 1TB of disk space. Though I imagine > compression or fancy encoding of the BP could improve that somewhat. For simplicity's sake, I've implemented the sorting such that we only do it for level=0 blocks and only for blocks with copies=1. Removing either of these seemed like a rather complex addition to the algorithm for relatively little payoff. That is not to say that it wouldn't work, but basically I was trying to keep this to a relatively simple modification. > I don't know whether to continue this project at this point. I'd like to > avoid having to competing implementations of essentially the same thing. > > Let me see if I can find our existing code and share it with you all. I'll try to throw our code up somewhere as well. Cheers, -- Saso ------------------------------------------- openzfs-developer Archives: https://www.listbox.com/member/archive/274414/=now RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa Modify Your Subscription: https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c Powered by Listbox: http://www.listbox.com
