On 04/07/2014 04:40 PM, Ellis H. Wilson III wrote:
On 04/07/2014 03:44 PM, Prentice Bisbal wrote:
As long as you use enterprise-grade SSDs (e.g., Intel's stuff) with
overprovisioning, the nand endurance shouldn't be an issue over the
lifetime of a cluster. We've used SSDs as our nodes' system disks for
a few years now (going on four with our oldest, 324-node production
system), and there have been no major problems. The major problems
happened when we were using the cheaper commodity SSDs. Don't give in
to the temptation to save a few pennies there.
Thanks for the info. Enterprise typically means MLC instead of SLC,
right?
After reading the link below, I think I got my concept of SLC and MLC
backwards. Sorry. Not an SSD expert by any stretch of the imagination.
There is a lot of cruft to filter through in the SSD space to
understand what the hell is really going on. First of all,
"enterprise" can really mean anything, but one thing is for certain:
it is more expensive. Enterprise can mean the same material (flash
cells) but a different (better) flash translation layer
(wear-leveling/garbage collection/etc) or a different feature size
(bigger generally means more reliable and less dense) or something
fancier fab tech. Blog with more info on the topic here:
http://www.violin-memory.com/blog/flash-flavors-mlc-emlc-or-vmlc-part-i/
Thanks. This was an excellent read.
Either way, "more bits-per-cell" are generally seen as less enterprise
than "fewer bits-per-cell." So, SLC is high-reliability enterprise,
MLC can be enterprise in some cases (marketing has even taken it upon
itself to brand some "eMLC" or enterprise MLC, which has about as much
meaning as can be expected) and TLC is arguably just commodity. Less
cells also means faster latencies, particularly for writes/erases.
I guess I disagree with the previous poster that saving by going the
commodity route, which, by the way, is not pennies but often upwards
of 50%, is always bad. It really depends on your situation/use-case.
I wouldn't store permanent data on outright commodity SSDs, but as a
LOCAL scratch-pad, they can be brilliant (and replacing them later may
be far more advisable than spending a ton up front and praying they
don't).
For instance, since you mention Hadoop, you are in a good situation to
consider commodity SSDs since it will automatically failover to
another node if one node's SSD is dead. It's not going to kill off
your whole job. Hadoop is built to cope with that. This being said,
I am not suggesting you necessarily should go the route of putting
HDFS on SSDs. The bandwidth and capacity concerns you raise are spot
on there. What I am suggesting is perhaps using a few commodity SSDs
for your mapreduce.cluster.local.dir, or where your intermediate data
will reside. You suggest "not many users will take advantage of
this." If your core application is Hadoop, every user will take
advantage of these SSDs (unless they explicitly override the tmp path,
which is possible), and the gains can be significant over HDDs.
Moreover, you aren't multiplexing persistent and temporary data
onto/from your HDFS HDDs, so you can get speedups getting to
persistent data as well since you've effectively created dedicated
storage pools for both types of accesses. This can be important.
I don't know much about Hadoop, but this seems like a good
middle-ground. I like technology to be transparent to users. If they
have to do something specific on their end to improve performance, 9/10
times, they won't either because they're unaware it's an available
feature, don't understand the impact of using it, or are just too lazy
to change their habits.
Caveat #1: Make sure to determine how much temporary space will be
needed, and acquire enough SSDs to cover that across the cluster.
That, or, instruct your users that "jobs generating up to XTB of
intermediate data can run on the SSDs, which is done by default, but
for jobs exceeding that use these extra parameters to send the tmp
data to HDDs." More complexity though. Depends on the user-base.
That could be a problem. Because really don't know the user habits yet,
since this will be my first Hadoop resource, so I have no job-size data
to go from. Also, see my previous comment.
Caveat #2: If you're building incredibly stacked boxes (e.g., 16+
HDDs) you may be resource-limited in a number of ways that makes
adding SATA SSDs unwise. May not be worth the effort to squeeze more
SSDs in there, or PCIe SSDs (tending to be more enterprise anyhow)
might be the way to go.
Caveat #3: Only certain types of Hadoop jobs really hammer
intermediate space. Read- and write-intensive jobs often won't, but
those special ones that do (e.g., Sort) benefit by immense amounts
with a fast intermediate space.
If that's the case, than your suggestion may not be worth it at this
point, since I don't have any usage data to determine whether it's a
wise investment.
Caveat #4: There are probably more caveats. My advice is to build a
two mock-up machines with and without it and run a "baby cluster"
Hadoop instance. This way, if SSDs really don't bring the performance
gains you want, you avoid buying a bunch, wasting money, and time
probably replacing them down the road.
This is the ideal approach. Unfortunately, I don't have the time or
resources for this. :(
More on Wear-Out: This is becoming an issue again for /modern/ feature
sizes and commodity bit-levels (especially TLC). For modern drives
with last gen or two gen back feature sizes, fancy wear-leveling and
egregious amounts of over-provisioning have more or less made wear-out
impossible for the lifetime of your machine.
Best,
ellis
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf