Re: Webservers with Terrabytes of Data in - recomended setups

Nick Holland Fri, 20 Apr 2007 20:58:32 -0700

J.C. Roberts wrote:
> On Friday 20 April 2007 08:32, Tony Abernethy wrote:
>> Jason Beaudoin wrote:
>> > <snip>
>> >
>> > > Use all the tricks you can for YOUR solution, including:
>> > >   * lots of "small" partitions
>> >
>> > What are the reasonings behind this?
>> >
>> > Thanks for the awesome post!
>>
>> I think it runs something like this
>> If there is a problem somewhere on the disk,
>> if it's all one big partition, you must fix the big partition
>> if it's lots of small partitions, you fix the one with the problem.
>>
>> Even worse, in some situations,
>> the difference is between being dead and being somewhat crippled.
>>
>> Methinks there's lots of hard-won experience behind Nick's answers ;)


yeah, though fortunately most of it was in the form of confirmation of
already held paranoia. :)

> You last assumption is the most correct, and Nick has put some of that 
> experience into FAQ-14 for our reading pleasure.

In addition to Tony and J.C.'s comments (I've edited them out for size,
go back and read 'em if you haven't), let me add another really big
reason: Growth and scalability.

Usual logic goes something like this: "I need a lot of space, so I'm
going to build a file system that has a lot of space in it", and you
drop all that space into one file system.  Efficient?  For a while,
yes.  BUT, what about when it fills up?

Usual response: "use a Volume Manager" or "Dump the data to a new,
bigger disk system".  Ok, the ability of some "volume managers" to
dynamically increase the size of a file system is kinda cool, but I
would argue that for many apps, it is just another way of saying,
"The initial design SUCKED and I had more money than brains to fix
the problem" (assuming one of the commercial products, of course).
Somewhat over simplification, of course...but...

Dumping the data from one disk to another is fine and dandy when you
are talking about your 40G disk on your home or desktop computer,
the fact that you are down for a few hours is no big deal.  But what
about a server?  I don't care how fast your disks are, moving 300G of
data to a new disk system is a lot of slow work.

Here's a better idea: break your data into more manageable chunks,
and design the system to fill those chunks AND make it easy to add
more later.  So, you implement today with 1TB of data space, broken
up into two 500G chunks.  Fill the first one, move on to the second
one.  Fill the second one, you bolt on more storage -- a process
which will probably take minutes, not hours.  When you bolt on more
storage, you will be doing it in the future, when capacity is bigger
and cost is less.

Let's look at the machine I mentioned yesterday, our e-mail archive
system:

disks:
Filesystem  1K-blocks      Used     Avail Capacity  Mounted on
/dev/wd0a      199358     49742    139650    26%    /
/dev/wd0e     1030550         6    979018     0%    /tmp
/dev/wd0d     4126462   2877500   1042640    73%    /usr
/dev/wd0f     1030550    197942    781082    20%    /var
/dev/wd0h     3093790   1989324    949778    68%    /home
/dev/wd0g   154803456  18471934 128591350    13%    /archive
/dev/wd2e   288411108 262462948  11527606    96%    /archive/a03
/dev/wd2f   288408068 264898440   9089226    97%    /archive/a04
/dev/wd3e   480678832 442797322  13847570    97%    /archive/a05
/dev/wd3f   480675792 440723042  15918962    97%    /archive/a06
/dev/wd4e   480678832 439989958  16654934    96%    /archive/a07
/dev/wd4f   480675792 443581618  13060386    97%    /archive/a08
/dev/wd1e   480678840  19931182 436713716     4%    /archive/a09
/dev/wd1f   480678368         2 456644448     0%    /archive/a10

Look that over carefully, you can almost see the story of the
machine's design.

wd0 is a mirrored pair of 300G SATA drives (Accusys 75160).  Note
that only a little more than half the drive is allocated at this
time!  Why?  Because there's no reason to wait for an fsck on 300G
when 160G is plenty.  And besides, I may have guessed wrong in how
big I made /var or /tmp or ...

wd2 is a RAID5 set of 300G SATA drives (Accusys 76130).  Why?
Because it was the biggest bang for the buck at the time, split down
the middle for manageability.

wd1 also started out as 300G drives, but has since been replaced by
the now cheaper/G 500G drives.  It has only just started being used
a couple days ago.

wd3 and wd4 are also 1TB arrays made up of three 500G drives.  They
were purchased after the original 300G drives were getting full.
Funny how that works, the 500G drives we just purchased (a09 and a10)
cost less than the 300G drives we installed originally.  Delaying
purchasing storage until you need it is a good thing!

The suspiciously missing a01 and a02 partitions are now sitting on a
shelf, as they have been removed from the system.  It is
relatively unlikely that we will be needing to go back to those, but
we hang on to 'em, Just In Case (and it is cheaper to hang onto three
300G SATA drives now than it is to restore from DVD if we were to
need to).  Granted, in five years, those drives may not spin up, nor
may be we be able to find anything to plug those drives into ("SATA?
Wow...I think I remember SATA...wasn't that a religious holiday?")

Hopefully, either the next block of storage that we get or the one
after that could use FFS2 and be made up of clusters of 750G or 1T
disks.  But if it can't be, that's ok too (and I'd probably stick
to 500G chunks anyway, that seems to be a nice size, it would just
be handy to be able to partition a DISK bigger than 1T for this app).

You see that new storage can be added as it is needed, so I can
take full advantage of the dropping price of storage.  I don't have
to buy way too much storage in advance.  Also, we get to test the
automatic "rollover" mechanism every few months, which is better
than testing it every year or less often (as it is, I am embarrassed
by how many roll-overs were bobbled on this system because the new
partitions didn't have the ownership set right.  Fortunately the
rest of this system is designed to handle that well).

Our goal here is to store seven years of data!  Think about the math
on that...at the time I installed this system, we were archiving
4GB/day.  Assuming 250 business days a year, that was seven TB of
storage, most of it would have sat idle, and by the time the last
TB being filled, we'd have been able to buy 10TB disks at the local
office supply store.  Well, now a year and a half later, we are over
9GB/day, so our original math was dead wrong, so we would have
managed to have bought both TOO MUCH and TOO LITTLE storage up
front, at the same time, and paid way too much...and half way
through the project's archive life, the storage would look
pathetically low tech.

With this system, we can add storage as needed, plus we can take
old storage off-line when its need becomes relatively unlikely.


If you need a lot of storage, odds are, some day in the future,
you will need a lot MORE storage.  You had best have some idea
how that will be handled.


ok, that showed scalability and growth ability.  A note on the
recovery functionality:

We had a data loss event on this thing, a drive failed so that it
put a dead short across the power supply, it blew out both the RAID
box and a power supply in the machine (and the power supply on
ANOTHER machine that I tested the failed drive in!).  I.e., a
catastrophic failure, but entirely imaginable, and something I
always warn people about, a drive failure that takes out your
entire RAID system.  Technical failure.

When I replaced the RAID box, I missed that it was incorrectly
jumpered, and ended up rebuilding the array with the replaced disk
as RAID0 rather than RAID5, so all the data was lost...ON THAT
MODULE.  To be fair, I feel I should point out that the Accusys
box spent a lot of time trying to keep me from doing something
stupid to my data, and it did know I was trying to hurt myself.
However, after that kind of disk failure, I really did believe
it likely my entire array was trashed, so I was too quick to
assume I knew why the box was fighting me and finally forced it
to complete reinitialization.  It wasn't until I saw the 1.5TB
"drive" in dmesg that I realized what had happened.  Operator
failure (yes, your designs had better take that into account!)

The other (then) three arrays were absolutely fine and the one
that blew/was blown was only about one quarter full, so the
restoration was unpleasant, but beat the heck out of restoring
the entire system.

You can look at this machine and point out that among its 14
drives (four x 3 disk RAID5, one two disk mirror for the OS),
five disks are devoted to redundancy and claim my design is
inefficient.  I'll tell you it is good. :)  It fits the
requirement well, it was economical, it is maintainable, it
Just Works.


Before working on this project, I thought 8G hard disks were
pretty freaking cool.  Now, my perspective is skewed, I look
at a pile of 36G SCA disks and think, "what tiny drives!",
and 300G drives are just not that exciting to me anymore.
However, as this system reminds me once in a while, terabytes
(even gigabytes!) of storage is still huge now.  A "generic"
solution to all questions regarding that much data is just
not reasonable to expect at the moment (in ten years,
maybe...but then, I don't think we've ever come up with /the/
way to manage 40G of data, so maybe not), so spend a little
time thinking /your/ needs through carefully before deciding
on a solution.

Nick.

Re: Webservers with Terrabytes of Data in - recomended setups

Reply via email to