Re: High load on datanode startup

Darrell Taylor Fri, 11 May 2012 02:30:26 -0700

On Thu, May 10, 2012 at 5:58 PM, Raj Vishwanathan <[email protected]> wrote:


> Darrell
>
> Are the new dn,nn and mapred directories on the same physical disk?
> Nothing on NFS , correct?
>

Yes, that's correct


>
> Could you be having some hardware issue? Any clue in /var/log/messages or
> dmesg?
>

Hardware is good, all logs are clean.


>
> A non responsive system indicates a CPU that is really busy either doing
> something or waiting for something and the fact that it happens only on
> some nodes indicates a local problem.
>

Yes, it was a very strange problem, which I seemed to have solved (for
now).  So, yesterday I upgraded the cluster to cdh4, and I found some of
the nodes started to display similar behaviour but was able to catch then
early enough to do something about it, the solution was to remove the
hadoop-env.sh that I had copied over from the cdh3 install, the only thing
I had added to this file was the following which I did to get pig/hbase
talking :

export HADOOP_CLASSPATH="`/usr/bin/hbase classpath`:$HADOOP_CLASSPATH"

What I saw on the machine was thousands of recursive processes in ps of the
form 'bash /usr/bin/hbase classpath...',  Stopping everything didn't clean
the processes up so had to kill them manually with some grep/xargs foo.
 Once this was all cleaned up and the hadoop-env.sh file removed the nodes
seem to be happy again.

Darrell.


>
> Raj
>
>
>
> >________________________________
> > From: Darrell Taylor <[email protected]>
> >To: [email protected]
> >Cc: Raj Vishwanathan <[email protected]>
> >Sent: Thursday, May 10, 2012 3:57 AM
> >Subject: Re: High load on datanode startup
> >
> >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <[email protected]> wrote:
> >
> >> That's real weird..
> >>
> >> If you can reproduce this after a reboot, I'd recommend letting the DN
> >> run for a minute, and then capturing a "jstack <pid of dn>" as well as
> >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list.
> >
> >
> >What I did after the reboot this morning was to move the my dn, nn, and
> >mapred directories out of the the way, create a new one, formatted it, and
> >restarted the node, it's now happy.
> >
> >I'll try moving the directories back later and do the jstack as you
> suggest.
> >
> >
> >>
> >> What JVM/JDK are you using? What OS version?
> >>
> >
> >root@pl446:/# dpkg --get-selections | grep java
> >java-common                                     install
> >libjaxp1.3-java                                 install
> >libjaxp1.3-java-gcj                             install
> >libmysql-java                                   install
> >libxerces2-java                                 install
> >libxerces2-java-gcj                             install
> >sun-java6-bin                                   install
> >sun-java6-javadb                                install
> >sun-java6-jdk                                   install
> >sun-java6-jre                                   install
> >
> >root@pl446:/# java -version
> >java version "1.6.0_26"
> >Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
> >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
> >
> >root@pl446:/# cat /etc/issue
> >Debian GNU/Linux 6.0 \n \l
> >
> >
> >
> >>
> >> -Todd
> >>
> >>
> >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor
> >> <[email protected]> wrote:
> >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[email protected]>
> >> wrote:
> >> >
> >> >> The picture either too small or too pixelated for my eyes :-)
> >> >>
> >> >
> >> > There should be a zoom option in the top right of the page that allows
> >> you
> >> > to view it full size
> >> >
> >> >
> >> >>
> >> >> Can you login to the box and send the output of top? If the system is
> >> >> unresponsive, it has to be something more than an unbalanced hdfs
> >> cluster,
> >> >> methinks.
> >> >>
> >> >
> >> > Sorry, I'm unable to login to the box, it's completely unresponsive.
> >> >
> >> >
> >> >>
> >> >> Raj
> >> >>
> >> >>
> >> >>
> >> >> >________________________________
> >> >> > From: Darrell Taylor <[email protected]>
> >> >> >To: [email protected]; Raj Vishwanathan <
> [email protected]
> >> >
> >> >> >Sent: Wednesday, May 9, 2012 2:40 PM
> >> >> >Subject: Re: High load on datanode startup
> >> >> >
> >> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <
> [email protected]>
> >> >> wrote:
> >> >> >
> >> >> >> When you say 'load', what do you mean? CPU load or something else?
> >> >> >>
> >> >> >
> >> >> >I mean in the unix sense of load average, i.e. top would show a
> load of
> >> >> >(currently) 376.
> >> >> >
> >> >> >Looking at Ganglia stats for the box it's not CPU load as such, the
> >> graphs
> >> >> >shows actual CPU usage as 30%, but the number of running processes
> is
> >> >> >simply growing in a linear manner - screen shot of ganglia page
> here :
> >> >> >
> >> >> >
> >> >>
> >>
> https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Raj
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> >________________________________
> >> >> >> > From: Darrell Taylor <[email protected]>
> >> >> >> >To: [email protected]
> >> >> >> >Sent: Wednesday, May 9, 2012 9:52 AM
> >> >> >> >Subject: High load on datanode startup
> >> >> >> >
> >> >> >> >Hi,
> >> >> >> >
> >> >> >> >I wonder if someone could give some pointers with a problem I'm
> >> having?
> >> >> >> >
> >> >> >> >I have a 7 machine cluster setup for testing and we have been
> >> pouring
> >> >> data
> >> >> >> >into it for a week without issue, have learnt several thing along
> >> the
> >> >> way
> >> >> >> >and solved all the problems up to now by searching online, but
> now
> >> I'm
> >> >> >> >stuck.  One of the data nodes decided to have a load of 70+ this
> >> >> morning,
> >> >> >> >stopping datanode and tasktracker brought it back to normal, but
> >> every
> >> >> >> time
> >> >> >> >I start the datanode again the load shoots through the roof, and
> >> all I
> >> >> get
> >> >> >> >in the logs is :
> >> >> >> >
> >> >> >> >STARTUP_MSG: Starting DataNode
> >> >> >> >
> >> >> >> >
> >> >> >> >STARTUP_MSG:   host = pl464/10.20.16.64
> >> >> >> >
> >> >> >> >
> >> >> >> >STARTUP_MSG:   args = []
> >> >> >> >
> >> >> >> >
> >> >> >> >STARTUP_MSG:   version = 0.20.2-cdh3u3
> >> >> >> >
> >> >> >> >
> >> >> >> >STARTUP_MSG:   build =
> >> >> >>
> >> >> >>
> >> >>
> >>
> >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze
> >> >> >> >-************************************************************/
> >> >> >> >
> >> >> >> >
> >> >> >> >2012-05-09 16:12:05,925 INFO
> >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS
> Configuration
> >> >> >> already
> >> >> >> >set up for Hadoop, not re-installing.
> >> >> >> >
> >> >> >> >2012-05-09 16:12:06,139 INFO
> >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS
> Configuration
> >> >> >> already
> >> >> >> >set up for Hadoop, not re-installing.
> >> >> >> >
> >> >> >> >Nothing else.
> >> >> >> >
> >> >> >> >The load seems to max out only 1 of the CPUs, but the machine
> >> becomes
> >> >> >> >*very* unresponsive
> >> >> >> >
> >> >> >> >Anybody got any pointers of things I can try?
> >> >> >> >
> >> >> >> >Thanks
> >> >> >> >Darrell.
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
> >
> >
>

Re: High load on datanode startup

Reply via email to