Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

Richard Elling Sun, 28 Oct 2012 22:10:10 -0700

On Oct 28, 2012, at 5:10 AM, Robin Axelsson <[email protected]> 
wrote:
> On 2012-10-24 21:58, Timothy Coalson wrote:
>> On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson<
>> [email protected]>  wrote:
>>> It would be interesting to know how you convert a raidz2 stripe to say a
>>> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra
>>> parity drive by converting it to a raidz3 pool.  I'm imagining that would
>>> be like creating a raidz1 pool on top of the leaf vdevs that constitutes
>>> the raidz2 pool and the new leaf vdev which results in an additional parity
>>> drive. It doesn't sound too difficult to do that. Actually, this way you
>>> could even get raidz4 or raidz5 pools. Question is though, how things would
>>> pan out performance wise, I would imagine that a 55 drive raidz25 pool is
>>> really taxing on the CPU.
>>> 
>> Multiple parity is more complicated than that, an additional xor device (a
>> la traditional raid4) would end up with zeros everywhere, and couldn't
>> reconstruct your data from an additional failure.  Look at "computing
>> parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 .  While in theory it
>> can extend to more than 3 parity blocks, it is unclear whether more than 3
>> will offer any serious additional benefits (using multiple raidz2 vdevs can
>> give you better IOPS than larger raidz3 vdevs, with little change in raw
>> space efficiency).  There are also combinatorial implications to multiple
>> bit errors in a single data chunk with high parity levels, but that is
>> somewhat unlikely.
> 
> XOR you say? I didn't know that raidz used xor for parity. I thought they 
> used some kind of a Reed-Solomon implementation à la PAR2 on the block level 
> to achieve "RAID like" functionality. It never was stated from what I could 
> read in the documentation that the raid functionality was implemented like 
> traditional hardware RAID. If xor is the case then I'm curious as to how they 
> managed to pull off a raidz3 implementation with three disk redundancy.


The first parity is XOR (also a Reed-Solomon syndrome). The 2nd and 3rd 
parity are other syndromes.

Also, minor nit: there is no such thing as hardware RAID, there is only 
software RAID.
 -- richard

> 
> Maybe a good read into the zpool source code would help clarifying things...
> 
>> 
>> Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
>>> no-brainer; you just remove one drive from the pool and force zpool to
>>> accept the new state as "normal".
>>> 
>> A degraded raidz2 vdev has to compute the missing block from parity on
>> nearly every read, this is not the normal state of raidz1.  Changing the
>> parity level, either up or down, has similar complications in the on-disk
>> structure.
>> 
>> But expanding a raidz pool with additional storage while preserving the
>>> parity structure sounds a little bit trickier. I don't think I have that
>>> knowledge to write a bpr rewriter although I'm reading Solaris Internals
>>> right now ;)
>> 
>> Unless raidz* did something radically different than raid5/6 (as in, not
>> having the parity blocks necessarily next to each other in the data chunk,
>> and having their positions recorded in the data chunk itself), the position
>> of the parity and data blocks would change.  The "always consistent on
>> disk" approach of ZFS adds additional problems to this, which probably make
>> it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning
>> it has to find some free space every time it wants to update a chunk to the
>> new parity level.
>> 
>> 
>>>> What you describe here is known as unionfs in Linux, among others.
>>>> I think there were RFEs or otherwise expressed desires to make that
>>>> in Solaris and later illumos (I did campaign for that sometime ago),
>>>> but AFAIK this was not yet done by anyone.
>>>> 
>>>>  YES, UnionFS-like functionality is what I was talking about. It seems
>>> like it has been abandoned in favor of AuFS in the Linux and the BSD world.
>>> It seems to have functions that are a little overkill to use with zfs, such
>>> as copy-on-write. Perhaps a more simplistic implementation of it would be
>>> more suitable for zfs.
>>> 
>> You could create zfs filesystems for subfolders in your "dataset" from the
>> separate pools, and give them mountpoints that put them into the same
>> directory.  You would have to balance the data allocation between the pools
>> manually, though.
> 
> I know that works but I was talking about having files stored at different 
> (hardware) locations and yet being in the same ... folder, I guess you are 
> using MacOS :)
> 
>> 
>> Perhaps a similar functionality can be established through an abstraction
>>> layer behind network shares.
>>> 
>>> In Windows this functionality is called 'disk pooling', btw.
>> 
>> In ZFS, disk pooling is done by "creating a zpool", emphasis on singular.
>>  Do you actually expect a large portion of your disks to go offline
>> suddenly?  I don't see a good way to handle this (good meaning there are no
>> missing files under the expected error conditions) that gets you more than
>> 50% of your raw storage capacity (mirrors across the boundary of what you
>> expect to go down together).  I doubt I would like the outcome of having
>> some software make arbitrary decisions of what real filesystem to put each
>> file on, and then having one filesystem fail, so if you really expect this,
>> you may be happier keeping the two pools separate and deciding where to put
>> stuff yourself (since if you are expecting a set of disks to fail, I expect
>> you would have some idea as to which ones it would be, for instance an
>> external enclosure).
>> 
>> If, on the other hand, you don't expect your hardware to drop an entire set
>> of disks for no good reason, making them into one large storage pool and
>> putting your filesystem in it will share your data transparently across all
>> disks without needing to set anything else up.
>> 
>> Tim
> It seems that ZFS is good at protecting data but when things do happen to go 
> south then ZFS seems to be pretty bad at handling the situation.

Eh? This comment makes no sense.

> The more hard drives that are used in a storage pool the higher the 
> likelihood will be that something goes wrong.

yep, more stuff means more stuff to break.

> 
> While I agree that it is not reasonable to expect that all files will still 
> be accessible if a large portion of the disks go offline at least it would be 
> great if whatever happens to be in the remaining drives would still be 
> accessible.
> 
> One way to achieve something along that direction would be to create some 
> kind of a separation in the file system so that say two vdev configurations 
> are technically independent but together constitutes a common unified storage 
> location. It would be like cells in a ship; even if a few cells break and 
> take in water, the ship won't sink because the other cells are intact.

We call that RAID :-)
 -- richard

-- 

ZFS storage and performance consulting at http://www.RichardElling.com







_______________________________________________
OpenIndiana-discuss mailing list
[email protected]
http://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

Reply via email to