On Jul 12, 2011, at 4:34 PM, <[email protected]>
<[email protected]> wrote:
> I am working on deploying Hadoop on a small cluster. For now, I am interested
> in
> restarting (restart the node or even reboot the OS) the nodes Hadoop detects
> as
> crashed.
There are quite a few scenarios where one service may be up but another
may be down. So per-service is usually a better way to go.
> "Instead, one should monitor the namenode and jobtracker and alert based on a
> percentage of availability. ... "
> Indeed.
> I use Hadoop 0.20.203.
OK, then that means...
>
> "This can be done in a variety of ways, ..."
> Can you please provide any pointers.
... you're pretty much required to use JMX to query the NN and JT to
get node information, since the rest of the APIs weren't forward ported as
promised---Ganglia is out of the equation anyway. Luckily, it is fairly
trivial to setup a Nagios script to poll that information (and our experiences
say that information is actually working. Some stuff in the metricsv2 API
doesn't appear to be working properly on the DN and TT.).
> Do you know how I can access the monitoring information of the namenode or
> the
> jobtracker so I can extract a list of failed nodes?
Take a look at the DeadNodes and LiveNodes attributes in the NameNode
and JobTracker section of the Hadoop MBean. That's likely your best bet.
>
> Why I thought of using metrics information, is because they are periodic and
> seemed easy to access. I though of using them as heart beats only (i.e. if I
> do
> not receive the metric in 2-3 periods I reset the node).
You end up essentially doing the same that the NN and JT are doing...
so might as well just ask them rather than doing it again, generating even more
network traffic than necessary. Additionally, there are some failures where
the NN or JT may view a service daemon as down but it actually responds to
other queries (from thread death/lock-up). For example, we've got a job that
has on occasion tripped up the 0.20.2 DN with OOM issues. The process lies in
a psuedo-dead state due to some weird exception handling down in the bowels of
the code. The NN rightfully declares it as dead, but depending upon how you
ask the node itself, it may respond!
So be careful out there.