On Jul 12, 2011, at 3:02 PM, <[email protected]>
<[email protected]> wrote:
> I am new to Hadoop, and I apologies if this was answered before, or if this
> is
> not the right list for my question.
common-user@ would likely have been better, but I'm too lazy to forward
you there today. :)
>
> I am trying to do the following:
> 1- Read monitoring information from slave nodes in hadoop
> 2- Process the data to detect nodes failure (node crash, problems in requests
> ... etc) and decide if I need to restart the whole machine.
> 3- Restart the machine running the slave facing problems
At scale, one doesn't monitor individual nodes for up/down. Verifying
the up/down of a given node will drive you insane and is pretty much a waste of
time unless the grid itself is under-configured to the point that *every*
*node* *counts*. (If that is the case, then there are bigger issues afoot...)
Instead, one should monitor the namenode and jobtracker and alert based
on a percentage of availability. This can be done in a variety of ways,
depending upon which version of Hadoop is in play. For 0.20.2, a simple screen
scrape is good enough. I recommend warn on 10%, alert on 20%, panic on 30%.
> My question is for step 1- collecting monitoring information.
> I have checked Hadoop monitoring features. But currently you can forward the
> motioning data to files, or to Ganglia.
Do you want monitoring information or metrics information? Ganglia is
purely a metrics tool. Metrics are a different animal. While it is possible
to alert on them, in most cases they aren't particular useful in a monitoring
context other than up/down.