On Oct 28, 2012, at 5:10 AM, Robin Axelsson <[email protected]> wrote: > On 2012-10-24 21:58, Timothy Coalson wrote: >> On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson< >> [email protected]> wrote: >>> It would be interesting to know how you convert a raidz2 stripe to say a >>> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra >>> parity drive by converting it to a raidz3 pool. I'm imagining that would >>> be like creating a raidz1 pool on top of the leaf vdevs that constitutes >>> the raidz2 pool and the new leaf vdev which results in an additional parity >>> drive. It doesn't sound too difficult to do that. Actually, this way you >>> could even get raidz4 or raidz5 pools. Question is though, how things would >>> pan out performance wise, I would imagine that a 55 drive raidz25 pool is >>> really taxing on the CPU. >>> >> Multiple parity is more complicated than that, an additional xor device (a >> la traditional raid4) would end up with zeros everywhere, and couldn't >> reconstruct your data from an additional failure. Look at "computing >> parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 . While in theory it >> can extend to more than 3 parity blocks, it is unclear whether more than 3 >> will offer any serious additional benefits (using multiple raidz2 vdevs can >> give you better IOPS than larger raidz3 vdevs, with little change in raw >> space efficiency). There are also combinatorial implications to multiple >> bit errors in a single data chunk with high parity levels, but that is >> somewhat unlikely. > > XOR you say? I didn't know that raidz used xor for parity. I thought they > used some kind of a Reed-Solomon implementation à la PAR2 on the block level > to achieve "RAID like" functionality. It never was stated from what I could > read in the documentation that the raid functionality was implemented like > traditional hardware RAID. If xor is the case then I'm curious as to how they > managed to pull off a raidz3 implementation with three disk redundancy.
The first parity is XOR (also a Reed-Solomon syndrome). The 2nd and 3rd parity are other syndromes. Also, minor nit: there is no such thing as hardware RAID, there is only software RAID. -- richard > > Maybe a good read into the zpool source code would help clarifying things... > >> >> Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a >>> no-brainer; you just remove one drive from the pool and force zpool to >>> accept the new state as "normal". >>> >> A degraded raidz2 vdev has to compute the missing block from parity on >> nearly every read, this is not the normal state of raidz1. Changing the >> parity level, either up or down, has similar complications in the on-disk >> structure. >> >> But expanding a raidz pool with additional storage while preserving the >>> parity structure sounds a little bit trickier. I don't think I have that >>> knowledge to write a bpr rewriter although I'm reading Solaris Internals >>> right now ;) >> >> Unless raidz* did something radically different than raid5/6 (as in, not >> having the parity blocks necessarily next to each other in the data chunk, >> and having their positions recorded in the data chunk itself), the position >> of the parity and data blocks would change. The "always consistent on >> disk" approach of ZFS adds additional problems to this, which probably make >> it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning >> it has to find some free space every time it wants to update a chunk to the >> new parity level. >> >> >>>> What you describe here is known as unionfs in Linux, among others. >>>> I think there were RFEs or otherwise expressed desires to make that >>>> in Solaris and later illumos (I did campaign for that sometime ago), >>>> but AFAIK this was not yet done by anyone. >>>> >>>> YES, UnionFS-like functionality is what I was talking about. It seems >>> like it has been abandoned in favor of AuFS in the Linux and the BSD world. >>> It seems to have functions that are a little overkill to use with zfs, such >>> as copy-on-write. Perhaps a more simplistic implementation of it would be >>> more suitable for zfs. >>> >> You could create zfs filesystems for subfolders in your "dataset" from the >> separate pools, and give them mountpoints that put them into the same >> directory. You would have to balance the data allocation between the pools >> manually, though. > > I know that works but I was talking about having files stored at different > (hardware) locations and yet being in the same ... folder, I guess you are > using MacOS :) > >> >> Perhaps a similar functionality can be established through an abstraction >>> layer behind network shares. >>> >>> In Windows this functionality is called 'disk pooling', btw. >> >> In ZFS, disk pooling is done by "creating a zpool", emphasis on singular. >> Do you actually expect a large portion of your disks to go offline >> suddenly? I don't see a good way to handle this (good meaning there are no >> missing files under the expected error conditions) that gets you more than >> 50% of your raw storage capacity (mirrors across the boundary of what you >> expect to go down together). I doubt I would like the outcome of having >> some software make arbitrary decisions of what real filesystem to put each >> file on, and then having one filesystem fail, so if you really expect this, >> you may be happier keeping the two pools separate and deciding where to put >> stuff yourself (since if you are expecting a set of disks to fail, I expect >> you would have some idea as to which ones it would be, for instance an >> external enclosure). >> >> If, on the other hand, you don't expect your hardware to drop an entire set >> of disks for no good reason, making them into one large storage pool and >> putting your filesystem in it will share your data transparently across all >> disks without needing to set anything else up. >> >> Tim > It seems that ZFS is good at protecting data but when things do happen to go > south then ZFS seems to be pretty bad at handling the situation. Eh? This comment makes no sense. > The more hard drives that are used in a storage pool the higher the > likelihood will be that something goes wrong. yep, more stuff means more stuff to break. > > While I agree that it is not reasonable to expect that all files will still > be accessible if a large portion of the disks go offline at least it would be > great if whatever happens to be in the remaining drives would still be > accessible. > > One way to achieve something along that direction would be to create some > kind of a separation in the file system so that say two vdev configurations > are technically independent but together constitutes a common unified storage > location. It would be like cells in a ship; even if a few cells break and > take in water, the ship won't sink because the other cells are intact. We call that RAID :-) -- richard -- ZFS storage and performance consulting at http://www.RichardElling.com _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
