Re: [Beowulf] Big storage

Jeffrey B. Layton Mon, 27 Aug 2007 14:21:22 -0700

Bruce,

IMHO the fundamental problem is not necessarily the bad sectors
that happen from time to time, although you have to have some
way of recovering the data (I don't know much about specific
RAID cards and what they do, but I'm pretty sure that a number
of storage vendors don't scan for bad sectors at any time). I don't
believe this is necessarily the point.


I think the point is that if a RAID array has a bad disk (for what
ever reason) then the array has to be reconstructed from the
remaining data and parity. During this reconstruction process,
the probability of encountering a read error is high. The probability
depends upon the number of disks, and the URE rate.

If you have a RAID-5 volume (N disks) and you are rebuilding and
hit a read error, the reconstruction stops and you have to restore from
backup. If you have a RAID-6 volume (N disks) and one disk has
failed (N-1) and you are reconstructing, then the reconstruction can
continue because you have the ability to tolerate two failed disks.
I'm not really sure what happens if during the reconstruction with
N-1 disks, it hits a read error. It may reconstruct the bad block from the
remaining N-1 drives or it may just mark the drive as down and recover
the block from the remaining N-2 disks.

In general you are vulnerable during the reconstruction period. If
you have a RAID-5 volume, lose a disk and start reconstruction, you
have a period of time where if you lose another disk you will lose
all the data on the volume. You could also consider hitting a read
error during reconstruction as a "failure". How long this period of
time is,  is is fairly important. If you can reconstruct  during this time
period, you are fine (if you have enough disks for a spare or you
can put a disk in to act as a spare).

If you have a RAID-6 volume, lose a disk and start reconstruction,
you also have a period of time where you vulnerable.  The problem
with RAID-6 is that it takes more work to reconstruct the data. So
while you have some extra protection from the second disk, it takes
longer to reconstruct the data. I don't know the reconstruction times
of RAID-5 vs. RAID-6 unfortunately. So this window may be larger
or smaller than the RAID-5 window. I'm guessing that it's smaller,
but I don't know for sure.

I think there several important points here.

1. The sectors on disk need to be scanned continually to find bad
sectors (to have them remapped and have the data on the sectors
rebuilt).

2. If you have a RAID controller and a RAID-5 volume and lose a
disk and then hit a read error, the volume is failed and you have to
restore the volume from backup. As disks get bigger it could take
a long time to do this.

3. If you have a RAID controller and a RAID-6 volume and lose
a disk, then you can reconstruct. I'm not sure what a read error
does on the remaining N-1 disks, you might or might not have
problems.

So it's reconstruction that is a concern.

Jeff

Jeff,
I did read Garth's comments. I believe that there are two types ofpossible problems:
(1) A sector or handful of sectors on a disk become unreadable
(2) An entire disk fails (all sectors become unreadable)
Problems of type (1) can be handled well by high quality raidimplementations. They are not serious, in principle, because thenecessary redundant data for those few blocks exists elsewhere on thearray, and is statistically very unlikely to also be unreadable. Also,high-quality implementation regularly scans disks looking foruncorrectable blocks, so that these can be rewritten from redundantdata. A high-quality RAID-6 implementation can also handle failures oftype (1) on the redundant disks, even when rebuilding one of twofailed disks. More serious is the problem of having two failed disks(2) and THEN encountering unreadable sectors on the remaining disks.
In short, as I see it, the real issue is with failed disks, not withunreadable sectors. Unreadable sectors are unlikely to happen at thesame LBAs on two disks, unless the entire disk has failed. So theright question is (for RAID-6) what is the probability of two faileddisks within the rebuild time window, and how likely is it thatuncorrectable sectors have appeared during that time?
Cheers,
    Bruce


On Fri, 24 Aug 2007, Jeffrey B. Layton wrote:
Bruce Allen wrote:
Hi Jeff,
OK, I see the point. You are not worried about multiple unreadablesectors making it impossible to reconstruct lost data. You areworried about 'whole disk' failure.
Well, no actually. I'm worried about unrecoverable reads on the
remaining disks during reconstruction. :) Is that what you are referring
to?
I definitely agree that this is a possible problem. In fact weoperate all of our UWM data archives (about 300 TB) as RAID-6 toreduce the probability of this. The idea of a second disk failingin a RAID-5 array during rebuild does not make for a good night'ssleep!
Did you see Garth's comments? Even using a number of 500TB drives
greatly increases the probability of a URE during reconstruction. RAID-6
helps you sleep, but not as much as you think :) Scares the cr** outof me.
I'm looking to build a home server and I think I'm going to do RAID-61
to give myself some extra protection. I just have to figure out howto powerall of them and find a case where they can fit and a motherboard withenough
SATA connectors :)

Enjoy!

Jeff
Cheers,
    Bruce

On Fri, 24 Aug 2007, Jeffrey B. Layton wrote:
Bruce,

I urge you to read Garth's comments. Your description of what
RAID controllers do is very good when there are no failed drives.
If a drive fails though, you can't scan the disks looking for bad
sectors.

During a reconstruction, the RAID controller is reconstructing
the data based on the remaining drives and the parity.
Unfortunately, the controller is likely to be block based so it has
to rebuild every block of the failed disk. But if the controller is
doing a reconstruction and hits a URE, then the reconstruction
process just stops and the controller cries uncle. This means you
have to restore the failed array from a backup. This means the
entire volume.

With drives getting larger and larger all the time, the window of
vulnerability during reconstruction (where a second drive failure
will fail the entire volume) has grown because it takes longer and
longer to reconstruct so much data. This is why people are moving
to RAID-6. But RAID-6 is expensive in terms of capacity andperformance
(Note: it has worse write performance than RAID-5). It gives the
ability to tolerate a second drive failure, but it may not reduce the
window of vulnerability during reconstruction because it takes longer
to reconstruct.

Here's an article where Garth talks about this (it's at the end):

http://www.eweek.com/article2/0,1895,2168821,00.asp

I wanted to note one quick thing from the article:

"The probability of the disk failing to read back data is the same as
it was long ago, so today you can expect at least one failed readevery
10TB to 100TB. But the reconstruction of a failed 500GB disk in an
11-disk array has to read 5TB, so there can be an unacceptably large
chance of failure to rebuild every one of the 1 billion sectors on the
failed disk."

So if a reconstruction fails, you have to copy 5TB of data from the
backup to the volume. If you do this from tape - you're going to wait
a long time. You can do it from a disk backup but it still may take
some time to move 5TB across the wire depending upon how you
everything connected.

Jeff
Hi Jeff,
For this reason, in a RAID system with a lot of disks it isimportant to scan the disks looking for unreadable (UNC =uncorrectable) data blocks on a regular basis. If these arefound, then the missing data at that Logical Block Address (LBA)has to be reconstructed from the *other* disks and re-written ontothe failed disk.
In a well-designed (hardware or software) RAID implementation, youcan reconstruct the missing data by only reading a handful oflogical blocks from the redundant disks. It is not necessary toread the entire disk surface just to get a few 512 byte sectors ofdata. So a failure for different data somewhere else on a diskshould not (in principle) prevent reconstruction of thelost/missing data. In a poorly-designed RAID implementation, youhave to read the ENTIRE disk surface to get data from a fewsectors. In this case, another uncorrectable disk sector can becrippling.
Most good hardware RAID cards have an option for continous diskscanning. For example ARECA called this 'consistency checking'.It should be done on a regular basis.
You can use smartmontools to do this also, by cayring out regularread scans of the disk surface and then forcing a RAID consistencycheck/rebuild if there is a read failure at some disk block.
Note that continous scanning is also needed for ECC memory toprevent correctable single-bit errors from becomming uncorrectabledouble-bit errors. In this RAM/memory context it is called'memory scrubbing'
Cheers,
    Bruce

On Thu, 23 Aug 2007, Jeffrey B. Layton wrote:
This isn't really directed at Jeff, but it seemed like a good segway
for a comment. Everyone - please read some recent article by
Garth Gibson about large capacity disks and large number of
disks in a RAID group. Just to cut to the chase, given the
Unrecoverable Read Error (URE) rate and large disks, during
a rebuild you are almost guaranteed to hit a URE. When that
happens, the rebuild stops and you have to restore everything
from a backup. RAID-6 can help, but given enough disks and
large enough disks, the same thing can happen (plus RAID-6
rebuilds take longer since there are more computations involved).

Jeff

P.S. I guess I should disclose that my day job is at Panasas. But
regardless, I would recommend reading some of Garth's comments.
Maybe I can also get one of his presentations to pass around.

P.P.S. If you don't know Garth, he's one of the fathers of RAID.
Hello Jakob,
A couple of things...
1. ClusterFS has an easy to understand calculation on why raid 6 is
necessary for the amount of disks you're considering. You doneed toplan for multi-disk failure, especially with the rebuild time of1TB
disks.
http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-10-1.html#wp10375122. Avoid tape if you can. At this scale, the administrative timeand
costs far outweigh the benefits. Of course if you need to move your
data to a secure vault that's another thing. If you really wantto dotape, some people choose to do disk > disk > tape. Thiseliminates the
read interrupts on the primary storage and provides some added
redundancy.
3. We do use Nexsan's satabeasts for storage similar to this.Without
commenting on costs, the jackrabbit is technologically superior.

Thanks,
                jeff

On 8/23/07, Jakob Oestergaard <[EMAIL PROTECTED]> wrote:
On Thu, Aug 23, 2007 at 07:56:15AM -0400, Joe Landman wrote:
Greetings Jakob:
Hi Joe,

Thanks for answering!

...
up front disclaimer: we design/build/market/support such things.
That does not disqualify you  :)
I'm looking at getting some big storage. Of all theparameters, getting as lowdollars/(month*GB) is by far the most important. The price ofacquiring and
maintaining the storage solution is the number one concern.
Should I presume density, reliability, and performance alsofactor in
somewhere as 2,3,4 (somehow) on the concern list?
I expect that the major components of the total cost of runningthis beast will
be something like

   acquisition
 + power
 + cooling
 + payroll (disk-replacing admins :)
Real-estate is a concern as well, of course. The rent isn'tfree. It would benice to pack this in as few racks as possible. Reliability,well... I expectfrequent drive failures, and I would expect that we'd run someform of RAID tomitigate this. If the rest of the hardware is just reasonablywell designed,the most frequently failing components should be redundant andhot-swap
replacable (fans and PSUs).
It's acceptable that a head-node fails for a short period oftime. The entiresystem will not depend on all head nodes functioningsimultaneously.
The setup will probably have a number of "head nodes" whichreceive a largeamount of data over standard gigabit from a large amount ofremote sources.Data is read infrequently from the head nodes by remotesystems. The primary
load on the system will be data writes.
Ok, so you are write dominated. Could you describe (guessesare fine)what the writes will look like? Large sequential data, smallrandom
data (seek, write, close)?
I would expect something like 100-1000 simultaneous streamingwrites to just asmany files (one file per writer). The files will be everythingfrom a few
hundred MiB to many GiB.
I guess that on most filesystems these streaming sequentialwrites will resultin something close to "random writes" to the block layer.However, we can be
very generous with write buffering.
The head nodes need not see the same unified storage; so I amnot required tohave one big shared filesystem. If beneficial, each of thehead nodes could
have their own local storage.
There are some interesting designs with a variety of systems,includingGFS/Lustre/... on those head nodes, and a big pool of drivesbehindthem. These designs will add to the overall cost, andincrease complexity.
Simple is nice :)
The storage pool will start out at around 100TiB and willgrow to ~1PiB withina year or two (too early to tell). It would be nice to use asfew racks as
possible, and as little power as possible  :)
Ok, so density and power are important. This is good.Coupled with thelow management cost and low acquisition cost, we have about3/4 of what
we need.  Just need a little more description of the writes.
I hope the above helped.
Also, do you intend to back this up?
That is a *very* good question.
How important is resiliency of the
system? Can you tolerate a failed unit (assume the units havehot
spares, RAID6, etc).
Yes. Single head nodes may fail. They must be fairly quick toget back on line(having a replacement box I would expect no more than an hourof downtime).
When you look at storage of this size, you have to
start planning for the eventual (and likely) failure of achassis (or
some number of them), and think about with a RAIN configuration.
Yep. I don't know how likely a "many-disk" failure would be...If I have a fullreplacement chassis, I would guess that I could simply pull outall the disksfrom a failed system, move them to the replacement chassis andbe up and
running again in "short" time.
If a PSU decides to fry everything connected to it includingthe disks, then
yes, I can see the point in RAIN or a full backup.
It's a business decision if a full node loss would beacceptable. I honestlydon't know that, but it is definitely interesting to considerboth "yes" and
"no".
Either
that, or invest into massive low level redundancy (whichshould be scope
limited to the box it is on anyway).
Yes; I had something like RAID-5 or so in mind on the nodes.
It *might* be possible to offload older files to tape; doesanyone haveexperience with HSM on Linux? Does it work? Could it beworthwhile to
investigate?
Hmmm... First I would suggest avoiding tape, you shouldlikely belooking at disk to disk for backup, and use slower nearlinemechanisms.
Why would you avoid tape?
Let's say there was software which allowed me to offload datato tape in areasonable manner. Considering the running costs of disk versustape, tape
would win hands down on power, cooling and replacements.
Sure, the random seek time of a tape library sucks golf ballsthrough a gardenhose, but assuming that one could live with that, are theremore important
reasons to avoid tape?
One setup I was looking at, is simply using SunFire X4500systems (you can put48 standard 3.5" SATA drives in each 4U system). Assuming Ican buy them with1T SATA drives shortly, I could start out with 3 systems(12U) and grow theentire setup to 1P with 22 systems in little over two fullracks.
Any better ideas? Is there a way to get this more densewithout paying an arm
and a leg?  Has anyone tried something like this with HSM?
Yes, but I don't want to turn this into a commercial, so Iwill besuccinct. Scalable Informatics (my company) has a similarproduct,which does have a good price and price per gigabyte, whileprovidingexcellent performance. Details (white paper, benchmarks,presentations)
at the http://jackrabbit.scalableinformatics.com web site.
Yep, I was just looking at that actually.
The hardware looks similar in concept to the SunFire, but as Isee it you guys
have thought about a number of services atop of that (RAIN etc.)


Very interesting!

--

 / jakob


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Big storage

Reply via email to