On Monday November 27, [EMAIL PROTECTED] wrote:
> using reiserfs over raid5 with 5 disks. This is unnecessarily suboptimal, it should
>be that parity
> writes are 20% of the disk bandwidth. Comments?
>
> Is there a known reason why reiserfs over raid5 is way worse than ext2. Does ext2
>optimize for
> raid5 in some way?
>
> Hans
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
Having read the 6 or so followups to this question so far, I see there
is a reasonable mixture of information and misinformation floating
around.
As the the author of some of the software raid code in 2.4, and as one
who has looked at it deeply and think I understand it all, let me try
to present some useful facts.
1/ raid5 maintains a stripe cache, currently containing 128 stripes,
each as wide as one filesystem block (not on raid5 chunk!).
When a read or write request arrives, it is attached to the
appropriate stripe. If the needed stripe is not in the cache, a
free one is allocated. If there are no free stripes, 1/8 of the
cache is freed up.
2/ raid5 uses one of two methods to write out data and update the
parity for a stripe, rmw or rcw.
rmw: The Read Modify Write method reads the old parity block, and
the old data blocks for any block that needs to be written,
calculates the correct parity by subtracting the old data
blocks, and adding the new data blocks to the parity block, and
then writing out the new data and new parity. Note that some
old data may already be in the cache, and so will not need to be
read.
rcw: The ReConstruct Write method reads any data for blocks in the
stripe that do NOT need to be written, and then calculates the
correct parity from all current data (whether old or new) and
writes out the new data and the new parity.
It chooses between these two methods based on how many blocks need
to be pre-read before the parity calculation can be made.
3/ In 2.4 (but not in 2.2) access to the stripe cache is overly single
threaded in the sense that only one request can ever be attached to
a given stripe at a time. I gather that this was because the
author was faced with a need to make the code SMP safe (which was
not an issue in 2.2 do to the presence of the BKL) and not a lot of
time to work on it. The resulting solution works, but is not optimal.
Bless with more time, I have a patch which relaxes this restriction
and substantially improves throughput. See
http://www.cse.unsw.edu.au/~neilb/patches/linux/
It could be precisely this issue that Hans is seeing (if he is
using 2.4. He didn't say).
4/ As someone mentioned, the only optimisation ext2 has for raid is
the "-R stride" option.
This tells ext2 the stride sizeof the raid array, meaning the
minimum stretch of virtual addresses that will span every device in
the array. This is (the number of drives minus 1) times the chunk
size. I believe that mkfs.ext2 wants this in units of the
filesystem-block-size, but I'm not sure.
The effect of this is to position frequently accessed metadata,
such as allocation bitmaps, at different offsets into the stride, so
that it will get allocated evenly over all the different devices in
the array, so that there is no 'hot-drive' for metadata access.
5/ The best way to write to a raid5 array is by writing large
contiguous stretches of data. This should allow the raid5 array to
collect blocks into stripes and write a full stripe at a time,
which will not require any pre-reading. However it is not easy to
do this in Linux.
The "Best" interface for efficiently writing data is to have an
asynchronous write, and a synchronous flush.
The "write" says "write this whenever you like", and the "flush"
says "Ok, I want this written NOW".
Actually, it is possibly even better to have a three stages:
1/ this is ready to be written
2/ write this now please
3/ don't return until this is written.
The Unix system call interface has "write" with a very broad
"flush" - you can flush a whole file, but I don't think that you
can flush individual byte ranges (I could be wrong). It does
fairly well.
The NFS network filesystem, in version two, only has synchronous
writes. This makes writing a real bottle neck. In version three,
asynchronous writes were introduced together with a "COMMIT" which
would flush a given byte range. This makes writing of large files
much more efficient.
The Linux block device layer only has one flavour of write
request. It is not exactly a synchronous write, as the caller can
choose to wait or not. But that device driver has (almost) no way
of knowing where the caller is waiting or not.
This makes it hard to collect blocks together into a stripe. When
raid5 gets a write request, it really needs to start actiing on it
straight away, which means scheduling a read of the parity block
and the old data. While it is waiting for the read to complete, it
may get some more writes attached to the stripe, and may even get a
full stripe, but it cannot continue until the reads complete, and
the reads may well have been wasted time. For larger number of
drives, there could be less wastage, but there is still wastage.
I included an (almost) above. This is because there is a fairly
coarse method for drives to discover that a writer (or reader) is
now waiting for a response. This is called "plugging".
A device may choose to "plug" itself when it gets an I/O request.
This causes the request to be queued but, but the queue doesn't get
processed. Thus subsequent requests can be merged on the queue.
When the device gets unplugged, this smaller number of merged
requests gets dealt with more efficiently.
However, there is only one "unplug-all-devices"(*) call in the API
that a reader or write can make. It is not possible to unplug a
particular device, or better still, to unplug a particular request.
This works fairly well when doing IO on a single device - say an
IDE drive, but when doing I/O on a raid array, which involves a
number of devices, there will be a lot of unplug-all-devices calls,
and plugging will not be able to be so effective.
I have some patches which add plugging for raid5 writes, and it
DRAMATICALLY improves sequential write throughput on a 3 to 5 drive
raid5 array with a 4k chunk size. With other configurations there
is an improvement, but it is not so dramatic. There are various
reasons for this, but I believe that part of the reason is the
extra noise of unplug-all-device calls. I haven't explored this
very thoroughly yet.
So, in short, you can do better than the current code (see my
patches) but the Linux block-device API gets in the way a bit.
In 2.2, a different approach was possible. As all filesystem data
was in the buffer cache, which was physically addressed, the raid5
code could, when preparing to write, look in the buffer cache for
other blocks in the same stripe which were marked dirty, and
proactively write them (even though no write request had been
issued). This improved performance substantially for 2.2 raid5
writing. However is it not possible in 2.4 becuase filesystem
data is, by and large, not in the buffer cache - it is in the page
cage.
(*) The unplug-all-devices call is spelt:
run_task_queue(&tq_disk);
6/ With reference to Hans' question in a follow-up:
Is the following statement correct? Unless we write the whole stripe
next to instead of over the current data, we cannot guarantee
recoverability upon removal of a disk drive while the FS is in
operation, and this is likely to be much of the motivation for the
NetApp WAFL design as they gather writes into stripes (I think this
last is true, but not sure).
RAID5 cannot survive an unclean shutdown with a failed drive. This
is because a stripe may have been partially written at the point of
unclean shutdown, so reconstructing the missing block from the
remaining drives will likely produce garbage.
However, apart from that, there is no problems with loosing drives
while the FS is in operation.
There are (at least) two effective response to this problem:
1/ use NVRAM somewhere in the system so that you can effectively do
a two stage commit - commit data to NVRAM, and the write that
data to the array, and then release the data from NVRAM.
After an unclean shutdown, you re-write all data in NVRAM, and
you are safe.
Ofcourse, the NVRAM could be replaced by any logging device, such
as a separated mirrored pair of drives, but there could be a
performance cost in that.
This is a part of the NETAPP solution I believe.
2/ Use a filesystem that
- knows about the raid stripe size, and
- only ever writes full stripes, and
- does so to stripes which didn't previously contain live
data, and
- knows which stripes it has written recently (even after an
unclean shutdown) and
- can tell if a stripe was written correctly or not.
Such a filesystem could, on restart, read and re-write all stripes
which could have been written recently (since last sync), and so
ensure correct parity for all valid data. A log structured
filesystem is ideal for this task, and writing one is on my todo
list - though it is a rather large item :-)
My understanding of NETAPP's WAFL is definately incomplete, but I
don't believe that they gurantee to always do stripe wide writes,
though they certainly try to encourage it.
I hope this helps.
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]