On Mon, Dec 19, 2011 at 7:29 PM, rajesh balamohan <[email protected]>wrote:
> Hi John, > > Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what are the JVM > arguments you use for the spawning the map/reduce slots? > > Check if the JVM is stuck in the machine. Sometimes I have seen task JVM > just launching, gets into spinning mode and occupies 100% CPU. > Yep, this one that Rajesh mentions is a RHEL 6 bug: https://bugzilla.redhat.com/show_bug.cgi?id=750419 We can reproduce it in our RHEL6 QA clusters pretty reilably, but still working with RedHat to reproduce/fix. Thanks -Todd > > > On Fri, Dec 16, 2011 at 2:26 AM, John Miller <[email protected]> wrote: > >> Hello Arun,**** >> >> ** ** >> >> Thanks for the quick reply. I totally understand the CDH issue but >> figured I’d ask the broader community as well in case there was any >> upstream known issue as I’ve noticed some patches relating to “somewhat >> similar” issues.**** >> >> ** ** >> >> The jstack was currently on my radar but I hadn’t even thought about >> tcpdump to catch weather the tasks were heartbeating or not so thanks for >> the tip, will make sure to check that out! We are also planning our release >> update to CDH 3u2 vs. 3u0 which will give us the updated hadoop >> 0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix >> the issue as well, in which case I’ll at least let everyone here know if it >> does.**** >> >> ** ** >> >> Any further ideas or if anyone else has experienced this similar issue my >> ears are open. Thanks again Arun! J**** >> >> ** ** >> >> *John Miller **|* Sr. Linux Systems Administrator** >> >> [image: mybuys-ops-small] <http://mybuys.com/>** >> >> 530 E. Liberty St.**** >> >> Ann Arbor, MI 48104**** >> >> Direct: 734.922.7007**** >> >> *http://mybuys.com/* >> >> ** ** >> >> *From:* Arun C Murthy [mailto:[email protected]] >> *Sent:* Thursday, December 15, 2011 2:03 PM >> *To:* [email protected] >> *Subject:* Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout >> not working)**** >> >> ** ** >> >> Hi John,**** >> >> ** ** >> >> It's hard for folks on this list to diagnose CDH (you might have to ask >> their lists). However, I haven't seen similar issues with hadoop-0.20.2xx >> in a while.**** >> >> ** ** >> >> One thing to check would be to grab a stack trace (jstack) on the tasks >> to see what they are upto. Next, try get a tcpdump to see if the tasks are >> indeed sending heartbeats to the TT, which might be the reason the TTs >> aren't timing them out.**** >> >> ** ** >> >> hth,**** >> >> Arun**** >> >> ** ** >> >> On Dec 15, 2011, at 7:58 AM, John Miller wrote:**** >> >> >> >> **** >> >> I’ve recently come across some interesting things happening within a >> 50-node cluster regarding the tasktrackers and task attempts. Essentially >> tasks are being created but they are sticking at 0.0% and it seems the >> ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for >> days if we let them) and the jobs have to get killed. Its interesting to >> note that the HDFS datanode service and HBASE regionserver running on these >> nodes work fine and we’ve been simply shutting down the tasktracker service >> on them in order to get around jobs stalling forever.**** >> >> **** >> >> Some historical information… We’re running Cloudera’s cdh3u0 release, and >> this has so far only happened on a handful of random tasktracker nodes and >> it seems to only effected those that have been taken down for maintenance >> and then brought back into the cluster, or alternatively one node was >> brought into the cluster after it had been running for a while and we ran >> into the same issue. After re-adding the nodes back into the cluster the >> tasktracker service starts getting these stalls. Also know that this has >> not happened to every node that has been taken out of service for a time >> and then re-added… I would say about 1/3’rd of them or so has ran into this >> issue after maintenance. The particular maintenance issues on the effected >> nodes were NOT the same, i.e. one was bad ram another was a bad sector on a >> disk etc… never the same initial problem only the same outcome after >> rejoining the cluster.**** >> >> **** >> >> It’s also never the same mapred job that sticks, nor is there any time >> related evidence relating the stalls to a specific time of day. Rather the >> node will run fine for many jobs and then just all of a sudden some tasks >> will stall and stick at 0.0%. There are no visible errors in the log >> outputs, although nothing will move forward nor will it release the mappers >> for any other jobs to use until the stalled job is killed. It seems that >> the default ‘mapreduce.task.timeout’ just isn’t working for some reason.* >> *** >> >> **** >> >> Has anyone come across anything similar to this? I can provide more >> details/data as needed.**** >> >> **** >> >> *John Miller **|* Sr. Linux Systems Administrator**** >> >> <image001.png> <http://mybuys.com/>**** >> >> 530 E. Liberty St.**** >> >> Ann Arbor, MI 48104**** >> >> Direct: 734.922.7007**** >> >> *http://mybuys.com/***** >> >> **** >> >> ** ** >> > > -- Todd Lipcon Software Engineer, Cloudera
<<image001.png>>
