On Thu, Aug 23, 2007 at 07:56:15AM -0400, Joe Landman wrote:
Greetings Jakob:
Hi Joe,
Thanks for answering!
...
up front disclaimer: we design/build/market/support such things.
That does not disqualify you :)
I'm looking at getting some big storage. Of all the parameters,
getting as low
dollars/(month*GB) is by far the most important. The price of
acquiring and
maintaining the storage solution is the number one concern.
Should I presume density, reliability, and performance also factor in
somewhere as 2,3,4 (somehow) on the concern list?
I expect that the major components of the total cost of running
this beast will
be something like
acquisition
+ power
+ cooling
+ payroll (disk-replacing admins :)
Real-estate is a concern as well, of course. The rent isn't free.
It would be
nice to pack this in as few racks as possible. Reliability,
well... I expect
frequent drive failures, and I would expect that we'd run some form
of RAID to
mitigate this. If the rest of the hardware is just reasonably well
designed,
the most frequently failing components should be redundant and
hot-swap
replacable (fans and PSUs).
It's acceptable that a head-node fails for a short period of time.
The entire
system will not depend on all head nodes functioning simultaneously.
The setup will probably have a number of "head nodes" which
receive a large
amount of data over standard gigabit from a large amount of
remote sources.
Data is read infrequently from the head nodes by remote systems.
The primary
load on the system will be data writes.
Ok, so you are write dominated. Could you describe (guesses are
fine)
what the writes will look like? Large sequential data, small random
data (seek, write, close)?
I would expect something like 100-1000 simultaneous streaming
writes to just as
many files (one file per writer). The files will be everything from
a few
hundred MiB to many GiB.
I guess that on most filesystems these streaming sequential writes
will result
in something close to "random writes" to the block layer. However,
we can be
very generous with write buffering.
The head nodes need not see the same unified storage; so I am not
required to
have one big shared filesystem. If beneficial, each of the head
nodes could
have their own local storage.
There are some interesting designs with a variety of systems,
including
GFS/Lustre/... on those head nodes, and a big pool of drives behind
them. These designs will add to the overall cost, and increase
complexity.
Simple is nice :)
The storage pool will start out at around 100TiB and will grow to
~1PiB within
a year or two (too early to tell). It would be nice to use as few
racks as
possible, and as little power as possible :)
Ok, so density and power are important. This is good. Coupled
with the
low management cost and low acquisition cost, we have about 3/4
of what
we need. Just need a little more description of the writes.
I hope the above helped.
Also, do you intend to back this up?
That is a *very* good question.
How important is resiliency of the
system? Can you tolerate a failed unit (assume the units have hot
spares, RAID6, etc).
Yes. Single head nodes may fail. They must be fairly quick to get
back on line
(having a replacement box I would expect no more than an hour of
downtime).
When you look at storage of this size, you have to
start planning for the eventual (and likely) failure of a chassis (or
some number of them), and think about with a RAIN configuration.
Yep. I don't know how likely a "many-disk" failure would be... If I
have a full
replacement chassis, I would guess that I could simply pull out all
the disks
from a failed system, move them to the replacement chassis and be
up and
running again in "short" time.
If a PSU decides to fry everything connected to it including the
disks, then
yes, I can see the point in RAIN or a full backup.
It's a business decision if a full node loss would be acceptable. I
honestly
don't know that, but it is definitely interesting to consider both
"yes" and
"no".
Either
that, or invest into massive low level redundancy (which should be
scope
limited to the box it is on anyway).
Yes; I had something like RAID-5 or so in mind on the nodes.
It *might* be possible to offload older files to tape; does
anyone have
experience with HSM on Linux? Does it work? Could it be
worthwhile to
investigate?
Hmmm... First I would suggest avoiding tape, you should likely be
looking at disk to disk for backup, and use slower nearline
mechanisms.
Why would you avoid tape?
Let's say there was software which allowed me to offload data to
tape in a
reasonable manner. Considering the running costs of disk versus
tape, tape
would win hands down on power, cooling and replacements.
Sure, the random seek time of a tape library sucks golf balls
through a garden
hose, but assuming that one could live with that, are there more
important
reasons to avoid tape?
One setup I was looking at, is simply using SunFire X4500 systems
(you can put
48 standard 3.5" SATA drives in each 4U system). Assuming I can
buy them with
1T SATA drives shortly, I could start out with 3 systems (12U)
and grow the
entire setup to 1P with 22 systems in little over two full racks.
Any better ideas? Is there a way to get this more dense without
paying an arm
and a leg? Has anyone tried something like this with HSM?
Yes, but I don't want to turn this into a commercial, so I will be
succinct. Scalable Informatics (my company) has a similar product,
which does have a good price and price per gigabyte, while providing
excellent performance. Details (white paper, benchmarks,
presentations)
at the http://jackrabbit.scalableinformatics.com web site.
Yep, I was just looking at that actually.
The hardware looks similar in concept to the SunFire, but as I see
it you guys
have thought about a number of services atop of that (RAIN etc.)
Very interesting!
--
/ jakob
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf