RE: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)

John Miller Wed, 28 Dec 2011 13:37:33 -0800

Actually we found the issue that occurred today.  A rogue hbase
regionserver was started and it joined the cluster which caused a whole
bunch of issues.  After removing it the cluster is working properly.


 

Unfortunately this is not the same issue we've been seeing in the past
that originally prompted this email thread, rather this was something
that arose specifically today and was incorrectly tied to the issues we
were seeing previously which will still need more investigation when
they come up again.

 

John Miller  |  Sr. Linux Systems Administrator

  <http://mybuys.com/> 

530 E. Liberty St.

Ann Arbor, MI 48104

Direct: 734.922.7007

http://mybuys.com/ <http://mybuys.com/> 

 

From: John Miller 
Sent: Wednesday, December 28, 2011 2:09 PM
To: John Miller; [email protected]
Subject: RE: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
working)

 

Here's a jstack from the jobtracker of an instance of this issue we hit
today.  Unfortunately all of the tasks inside each mapper job have "no
task attemps found" and there are multiple jobs stuck at 89.49% and a
couple at 99.9% mapping.  Nothing has continued for hours.  Any new jobs
submitted get stuck the same as the others.  I was unable to grab a
tcpdump at this time from the tasktrackers heartbeats since there are no
child vm's for them.

 

John Miller  |  Sr. Linux Systems Administrator

 <http://mybuys.com/> 

530 E. Liberty St.

Ann Arbor, MI 48104

Direct: 734.922.7007

http://mybuys.com/ <http://mybuys.com/> 

 

From: John Miller 
Sent: Thursday, December 15, 2011 3:56 PM
To: [email protected]
Subject: RE: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
working)

 

Hello Arun,

 

Thanks for the quick reply.  I totally understand the CDH issue but
figured I'd ask the broader community as well in case there was any
upstream known issue as I've noticed some patches relating to "somewhat
similar" issues.

 

The jstack was currently on my radar but I hadn't even thought about
tcpdump to catch weather the tasks were heartbeating or not so thanks
for the tip, will make sure to check that out! We are also planning our
release update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
the issue as well, in which case I'll at least let everyone here know if
it does.

 

Any further ideas or if anyone else has experienced this similar issue
my ears are open.  Thanks again Arun! J

 

John Miller  |  Sr. Linux Systems Administrator

 <http://mybuys.com/> 

530 E. Liberty St.

Ann Arbor, MI 48104

Direct: 734.922.7007

http://mybuys.com/ <http://mybuys.com/> 

 

From: Arun C Murthy [mailto:[email protected]] 
Sent: Thursday, December 15, 2011 2:03 PM
To: [email protected]
Subject: Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
working)

 

Hi John,

 

 It's hard for folks on this list to diagnose CDH (you might have to ask
their lists). However, I haven't seen similar issues with
hadoop-0.20.2xx in a while.

 

 One thing to check would be to grab a stack trace (jstack) on the tasks
to see what they are upto. Next, try get a tcpdump to see if the tasks
are indeed sending heartbeats to the TT, which might be the reason the
TTs aren't timing them out.

 

hth,

Arun

 

On Dec 15, 2011, at 7:58 AM, John Miller wrote:

 

I've recently come across some interesting things happening within a
50-node cluster regarding the tasktrackers and task attempts.
Essentially tasks are being created but they are sticking at 0.0% and it
seems the 'mapreduce.task.timeout' isn't taking effect and they just sit
there (for days if we let them) and the jobs have to get killed.  Its
interesting to note that the HDFS datanode service and HBASE
regionserver running on these nodes work fine and we've been simply
shutting down the tasktracker service on them in order to get around
jobs stalling forever.

 

Some historical information... We're running Cloudera's cdh3u0 release,
and this has so far only happened on a handful of random tasktracker
nodes and it seems to only effected those that have been taken down for
maintenance and then brought back into the cluster, or alternatively one
node was brought into the cluster after it had been running for a while
and we ran into the same issue.  After re-adding the nodes back into the
cluster the tasktracker service starts getting these stalls.  Also know
that this has not happened to every node that has been taken out of
service for a time and then re-added... I would say about 1/3'rd of them
or so has ran into this issue after maintenance.  The particular
maintenance issues on the effected nodes were NOT the same, i.e. one was
bad ram another was a bad sector on a disk etc... never the same initial
problem only the same outcome after rejoining the cluster.

 

It's also never the same mapred job that sticks, nor is there any time
related evidence relating the stalls to a specific time of day.  Rather
the node will run fine for many jobs and then just all of a sudden some
tasks will stall and stick at 0.0%.  There are no visible errors in the
log outputs, although nothing will move forward nor will it release the
mappers for any other jobs to use until the stalled job is killed.  It
seems that the default 'mapreduce.task.timeout' just isn't working for
some reason.

 

Has anyone come across anything similar to this?  I can provide more
details/data as needed.

 

John Miller  |  Sr. Linux Systems Administrator

<image001.png> <http://mybuys.com/> 

530 E. Liberty St.

Ann Arbor, MI 48104

Direct: 734.922.7007

http://mybuys.com/ <http://mybuys.com/>

<<image001.png>>

RE: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)

Reply via email to