We're running jdk1.6.0_26 at this time and below is what we're calling the jobs with. If by spawning the map/reduce slots you mean what arguments are our internal classes using I would have to go back to the dev team and do some more digging.
$HADOOP_HOME/bin/hadoop jar ${JAR_FILE} com.our.private.class ${D_OPTS}
'-Dmapred.child.java.opts=-Xmx2g -server -Xss128k'
-Dmapred.reduce.tasks=${REDUCER_COUNT} -libjars ${LIB_JARS} ${SOURCE}
${OUTPUT_DIR} ${HDFS_PREFIX} ${HBASE_PREFIX}
I don't' recall seeing the 100% CPU issue with the child JVM's when this
happens. Unfortunately I'm also not able to replicate this again until
after the holidays (it's our busy season and we're in a holding
pattern).
John Miller | Sr. Linux Systems Administrator
<http://mybuys.com/>
530 E. Liberty St.
Ann Arbor, MI 48104
Direct: 734.922.7007
http://mybuys.com/ <http://mybuys.com/>
From: Todd Lipcon [mailto:[email protected]]
Sent: Monday, December 19, 2011 11:09 PM
To: [email protected]
Subject: Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
working)
On Mon, Dec 19, 2011 at 7:29 PM, rajesh balamohan <
[email protected]> wrote:
Hi John,
Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what
are the JVM arguments you use for the spawning the map/reduce slots?
Check if the JVM is stuck in the machine. Sometimes I have seen
task JVM just launching, gets into spinning mode and occupies 100% CPU.
Yep, this one that Rajesh mentions is a RHEL 6 bug:
https://bugzilla.redhat.com/show_bug.cgi?id=750419
We can reproduce it in our RHEL6 QA clusters pretty reilably, but still
working with RedHat to reproduce/fix.
Thanks
-Todd
On Fri, Dec 16, 2011 at 2:26 AM, John Miller <
[email protected]> wrote:
Hello Arun,
Thanks for the quick reply. I totally understand the CDH issue
but figured I'd ask the broader community as well in case there was any
upstream known issue as I've noticed some patches relating to "somewhat
similar" issues.
The jstack was currently on my radar but I hadn't even thought
about tcpdump to catch weather the tasks were heartbeating or not so
thanks for the tip, will make sure to check that out! We are also
planning our release update to CDH 3u2 vs. 3u0 which will give us the
updated hadoop 0.20.2+923.142 vs. our current 0.20.2+923.21 which may
inadvertently fix the issue as well, in which case I'll at least let
everyone here know if it does.
Any further ideas or if anyone else has experienced this similar
issue my ears are open. Thanks again Arun! J
John Miller | Sr. Linux Systems Administrator
<http://mybuys.com/>
530 E. Liberty St.
Ann Arbor, MI 48104
Direct: 734.922.7007
http://mybuys.com/ <http://mybuys.com/>
From: Arun C Murthy [mailto:[email protected]]
Sent: Thursday, December 15, 2011 2:03 PM
To: [email protected]
Subject: Re: Tasktracker Task Attempts Stuck
(mapreduce.task.timeout not working)
Hi John,
It's hard for folks on this list to diagnose CDH (you might
have to ask their lists). However, I haven't seen similar issues with
hadoop-0.20.2xx in a while.
One thing to check would be to grab a stack trace (jstack) on
the tasks to see what they are upto. Next, try get a tcpdump to see if
the tasks are indeed sending heartbeats to the TT, which might be the
reason the TTs aren't timing them out.
hth,
Arun
On Dec 15, 2011, at 7:58 AM, John Miller wrote:
I've recently come across some interesting things happening
within a 50-node cluster regarding the tasktrackers and task attempts.
Essentially tasks are being created but they are sticking at 0.0% and it
seems the 'mapreduce.task.timeout' isn't taking effect and they just sit
there (for days if we let them) and the jobs have to get killed. Its
interesting to note that the HDFS datanode service and HBASE
regionserver running on these nodes work fine and we've been simply
shutting down the tasktracker service on them in order to get around
jobs stalling forever.
Some historical information... We're running Cloudera's cdh3u0
release, and this has so far only happened on a handful of random
tasktracker nodes and it seems to only effected those that have been
taken down for maintenance and then brought back into the cluster, or
alternatively one node was brought into the cluster after it had been
running for a while and we ran into the same issue. After re-adding the
nodes back into the cluster the tasktracker service starts getting these
stalls. Also know that this has not happened to every node that has
been taken out of service for a time and then re-added... I would say
about 1/3'rd of them or so has ran into this issue after maintenance.
The particular maintenance issues on the effected nodes were NOT the
same, i.e. one was bad ram another was a bad sector on a disk etc...
never the same initial problem only the same outcome after rejoining the
cluster.
It's also never the same mapred job that sticks, nor is there
any time related evidence relating the stalls to a specific time of day.
Rather the node will run fine for many jobs and then just all of a
sudden some tasks will stall and stick at 0.0%. There are no visible
errors in the log outputs, although nothing will move forward nor will
it release the mappers for any other jobs to use until the stalled job
is killed. It seems that the default 'mapreduce.task.timeout' just
isn't working for some reason.
Has anyone come across anything similar to this? I can provide
more details/data as needed.
John Miller | Sr. Linux Systems Administrator
<image001.png> <http://mybuys.com/>
530 E. Liberty St.
Ann Arbor, MI 48104
Direct: 734.922.7007
http://mybuys.com/ <http://mybuys.com/>
--
Todd Lipcon
Software Engineer, Cloudera
<<image001.png>>
