Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)

Todd Lipcon Mon, 19 Dec 2011 20:09:25 -0800

On Mon, Dec 19, 2011 at 7:29 PM, rajesh balamohan <[email protected]>wrote:


> Hi John,
>
> Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what are the JVM
> arguments you use for the spawning the map/reduce slots?
>
> Check if the JVM is stuck in the machine. Sometimes I have seen task JVM
> just launching, gets into spinning mode and occupies 100% CPU.
>

Yep, this one that Rajesh mentions is a RHEL 6 bug:
https://bugzilla.redhat.com/show_bug.cgi?id=750419
We can reproduce it in our RHEL6 QA clusters pretty reilably, but still
working with RedHat to reproduce/fix.

Thanks
-Todd

>
>
> On Fri, Dec 16, 2011 at 2:26 AM, John Miller <[email protected]> wrote:
>
>> Hello Arun,****
>>
>> ** **
>>
>> Thanks for the quick reply.  I totally understand the CDH issue but
>> figured I’d ask the broader community as well in case there was any
>> upstream known issue as I’ve noticed some patches relating to “somewhat
>> similar” issues.****
>>
>> ** **
>>
>> The jstack was currently on my radar but I hadn’t even thought about
>> tcpdump to catch weather the tasks were heartbeating or not so thanks for
>> the tip, will make sure to check that out! We are also planning our release
>> update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
>> 0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
>> the issue as well, in which case I’ll at least let everyone here know if it
>> does.****
>>
>> ** **
>>
>> Any further ideas or if anyone else has experienced this similar issue my
>> ears are open.  Thanks again Arun! J****
>>
>> ** **
>>
>> *John Miller  **|*  Sr. Linux Systems Administrator**
>>
>> [image: mybuys-ops-small] <http://mybuys.com/>**
>>
>> 530 E. Liberty St.****
>>
>> Ann Arbor, MI 48104****
>>
>> Direct: 734.922.7007****
>>
>> *http://mybuys.com/*
>>
>> ** **
>>
>> *From:* Arun C Murthy [mailto:[email protected]]
>> *Sent:* Thursday, December 15, 2011 2:03 PM
>> *To:* [email protected]
>> *Subject:* Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout
>> not working)****
>>
>> ** **
>>
>> Hi John,****
>>
>> ** **
>>
>>  It's hard for folks on this list to diagnose CDH (you might have to ask
>> their lists). However, I haven't seen similar issues with hadoop-0.20.2xx
>> in a while.****
>>
>> ** **
>>
>>  One thing to check would be to grab a stack trace (jstack) on the tasks
>> to see what they are upto. Next, try get a tcpdump to see if the tasks are
>> indeed sending heartbeats to the TT, which might be the reason the TTs
>> aren't timing them out.****
>>
>> ** **
>>
>> hth,****
>>
>> Arun****
>>
>> ** **
>>
>> On Dec 15, 2011, at 7:58 AM, John Miller wrote:****
>>
>>
>>
>> ****
>>
>> I’ve recently come across some interesting things happening within a
>> 50-node cluster regarding the tasktrackers and task attempts.  Essentially
>> tasks are being created but they are sticking at 0.0% and it seems the
>> ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for
>> days if we let them) and the jobs have to get killed.  Its interesting to
>> note that the HDFS datanode service and HBASE regionserver running on these
>> nodes work fine and we’ve been simply shutting down the tasktracker service
>> on them in order to get around jobs stalling forever.****
>>
>>  ****
>>
>> Some historical information… We’re running Cloudera’s cdh3u0 release, and
>> this has so far only happened on a handful of random tasktracker nodes and
>> it seems to only effected those that have been taken down for maintenance
>> and then brought back into the cluster, or alternatively one node was
>> brought into the cluster after it had been running for a while and we ran
>> into the same issue.  After re-adding the nodes back into the cluster the
>> tasktracker service starts getting these stalls.  Also know that this has
>> not happened to every node that has been taken out of service for a time
>> and then re-added… I would say about 1/3’rd of them or so has ran into this
>> issue after maintenance.  The particular maintenance issues on the effected
>> nodes were NOT the same, i.e. one was bad ram another was a bad sector on a
>> disk etc… never the same initial problem only the same outcome after
>> rejoining the cluster.****
>>
>>  ****
>>
>> It’s also never the same mapred job that sticks, nor is there any time
>> related evidence relating the stalls to a specific time of day.  Rather the
>> node will run fine for many jobs and then just all of a sudden some tasks
>> will stall and stick at 0.0%.  There are no visible errors in the log
>> outputs, although nothing will move forward nor will it release the mappers
>> for any other jobs to use until the stalled job is killed.  It seems that
>> the default ‘mapreduce.task.timeout’ just isn’t working for some reason.*
>> ***
>>
>>  ****
>>
>> Has anyone come across anything similar to this?  I can provide more
>> details/data as needed.****
>>
>>  ****
>>
>> *John Miller  **|*  Sr. Linux Systems Administrator****
>>
>> <image001.png> <http://mybuys.com/>****
>>
>> 530 E. Liberty St.****
>>
>> Ann Arbor, MI 48104****
>>
>> Direct: 734.922.7007****
>>
>> *http://mybuys.com/*****
>>
>>  ****
>>
>> ** **
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

<<image001.png>>

Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)

Reply via email to