On Thu, May 10, 2012 at 5:58 PM, Raj Vishwanathan <[email protected]> wrote:
> Darrell > > Are the new dn,nn and mapred directories on the same physical disk? > Nothing on NFS , correct? > Yes, that's correct > > Could you be having some hardware issue? Any clue in /var/log/messages or > dmesg? > Hardware is good, all logs are clean. > > A non responsive system indicates a CPU that is really busy either doing > something or waiting for something and the fact that it happens only on > some nodes indicates a local problem. > Yes, it was a very strange problem, which I seemed to have solved (for now). So, yesterday I upgraded the cluster to cdh4, and I found some of the nodes started to display similar behaviour but was able to catch then early enough to do something about it, the solution was to remove the hadoop-env.sh that I had copied over from the cdh3 install, the only thing I had added to this file was the following which I did to get pig/hbase talking : export HADOOP_CLASSPATH="`/usr/bin/hbase classpath`:$HADOOP_CLASSPATH" What I saw on the machine was thousands of recursive processes in ps of the form 'bash /usr/bin/hbase classpath...', Stopping everything didn't clean the processes up so had to kill them manually with some grep/xargs foo. Once this was all cleaned up and the hadoop-env.sh file removed the nodes seem to be happy again. Darrell. > > Raj > > > > >________________________________ > > From: Darrell Taylor <[email protected]> > >To: [email protected] > >Cc: Raj Vishwanathan <[email protected]> > >Sent: Thursday, May 10, 2012 3:57 AM > >Subject: Re: High load on datanode startup > > > >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <[email protected]> wrote: > > > >> That's real weird.. > >> > >> If you can reproduce this after a reboot, I'd recommend letting the DN > >> run for a minute, and then capturing a "jstack <pid of dn>" as well as > >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list. > > > > > >What I did after the reboot this morning was to move the my dn, nn, and > >mapred directories out of the the way, create a new one, formatted it, and > >restarted the node, it's now happy. > > > >I'll try moving the directories back later and do the jstack as you > suggest. > > > > > >> > >> What JVM/JDK are you using? What OS version? > >> > > > >root@pl446:/# dpkg --get-selections | grep java > >java-common install > >libjaxp1.3-java install > >libjaxp1.3-java-gcj install > >libmysql-java install > >libxerces2-java install > >libxerces2-java-gcj install > >sun-java6-bin install > >sun-java6-javadb install > >sun-java6-jdk install > >sun-java6-jre install > > > >root@pl446:/# java -version > >java version "1.6.0_26" > >Java(TM) SE Runtime Environment (build 1.6.0_26-b03) > >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) > > > >root@pl446:/# cat /etc/issue > >Debian GNU/Linux 6.0 \n \l > > > > > > > >> > >> -Todd > >> > >> > >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor > >> <[email protected]> wrote: > >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <[email protected]> > >> wrote: > >> > > >> >> The picture either too small or too pixelated for my eyes :-) > >> >> > >> > > >> > There should be a zoom option in the top right of the page that allows > >> you > >> > to view it full size > >> > > >> > > >> >> > >> >> Can you login to the box and send the output of top? If the system is > >> >> unresponsive, it has to be something more than an unbalanced hdfs > >> cluster, > >> >> methinks. > >> >> > >> > > >> > Sorry, I'm unable to login to the box, it's completely unresponsive. > >> > > >> > > >> >> > >> >> Raj > >> >> > >> >> > >> >> > >> >> >________________________________ > >> >> > From: Darrell Taylor <[email protected]> > >> >> >To: [email protected]; Raj Vishwanathan < > [email protected] > >> > > >> >> >Sent: Wednesday, May 9, 2012 2:40 PM > >> >> >Subject: Re: High load on datanode startup > >> >> > > >> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan < > [email protected]> > >> >> wrote: > >> >> > > >> >> >> When you say 'load', what do you mean? CPU load or something else? > >> >> >> > >> >> > > >> >> >I mean in the unix sense of load average, i.e. top would show a > load of > >> >> >(currently) 376. > >> >> > > >> >> >Looking at Ganglia stats for the box it's not CPU load as such, the > >> graphs > >> >> >shows actual CPU usage as 30%, but the number of running processes > is > >> >> >simply growing in a linear manner - screen shot of ganglia page > here : > >> >> > > >> >> > > >> >> > >> > https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink > >> >> > > >> >> > > >> >> > > >> >> >> > >> >> >> Raj > >> >> >> > >> >> >> > >> >> >> > >> >> >> >________________________________ > >> >> >> > From: Darrell Taylor <[email protected]> > >> >> >> >To: [email protected] > >> >> >> >Sent: Wednesday, May 9, 2012 9:52 AM > >> >> >> >Subject: High load on datanode startup > >> >> >> > > >> >> >> >Hi, > >> >> >> > > >> >> >> >I wonder if someone could give some pointers with a problem I'm > >> having? > >> >> >> > > >> >> >> >I have a 7 machine cluster setup for testing and we have been > >> pouring > >> >> data > >> >> >> >into it for a week without issue, have learnt several thing along > >> the > >> >> way > >> >> >> >and solved all the problems up to now by searching online, but > now > >> I'm > >> >> >> >stuck. One of the data nodes decided to have a load of 70+ this > >> >> morning, > >> >> >> >stopping datanode and tasktracker brought it back to normal, but > >> every > >> >> >> time > >> >> >> >I start the datanode again the load shoots through the roof, and > >> all I > >> >> get > >> >> >> >in the logs is : > >> >> >> > > >> >> >> >STARTUP_MSG: Starting DataNode > >> >> >> > > >> >> >> > > >> >> >> >STARTUP_MSG: host = pl464/10.20.16.64 > >> >> >> > > >> >> >> > > >> >> >> >STARTUP_MSG: args = [] > >> >> >> > > >> >> >> > > >> >> >> >STARTUP_MSG: version = 0.20.2-cdh3u3 > >> >> >> > > >> >> >> > > >> >> >> >STARTUP_MSG: build = > >> >> >> > >> >> >> > >> >> > >> > >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze > >> >> >> >-************************************************************/ > >> >> >> > > >> >> >> > > >> >> >> >2012-05-09 16:12:05,925 INFO > >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS > Configuration > >> >> >> already > >> >> >> >set up for Hadoop, not re-installing. > >> >> >> > > >> >> >> >2012-05-09 16:12:06,139 INFO > >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS > Configuration > >> >> >> already > >> >> >> >set up for Hadoop, not re-installing. > >> >> >> > > >> >> >> >Nothing else. > >> >> >> > > >> >> >> >The load seems to max out only 1 of the CPUs, but the machine > >> becomes > >> >> >> >*very* unresponsive > >> >> >> > > >> >> >> >Anybody got any pointers of things I can try? > >> >> >> > > >> >> >> >Thanks > >> >> >> >Darrell. > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > >> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > > > > > >
