Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)


+
John Miller 2011-12-15, 15:58
+
Arun C Murthy 2011-12-15, 19:03
+
John Miller 2011-12-15, 20:56
+
rajesh balamohan 2011-12-20, 03:29
Copy link to this message
-
Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)
On Mon, Dec 19, 2011 at 7:29 PM, rajesh balamohan <[EMAIL PROTECTED]>wrote:

> Hi John,
>
> Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what are the JVM
> arguments you use for the spawning the map/reduce slots?
>
> Check if the JVM is stuck in the machine. Sometimes I have seen task JVM
> just launching, gets into spinning mode and occupies 100% CPU.
>

Yep, this one that Rajesh mentions is a RHEL 6 bug:
https://bugzilla.redhat.com/show_bug.cgi?id=750419
We can reproduce it in our RHEL6 QA clusters pretty reilably, but still
working with RedHat to reproduce/fix.

Thanks
-Todd

>
>
> On Fri, Dec 16, 2011 at 2:26 AM, John Miller <[EMAIL PROTECTED]> wrote:
>
>> Hello Arun,****
>>
>> ** **
>>
>> Thanks for the quick reply.  I totally understand the CDH issue but
>> figured I’d ask the broader community as well in case there was any
>> upstream known issue as I’ve noticed some patches relating to “somewhat
>> similar” issues.****
>>
>> ** **
>>
>> The jstack was currently on my radar but I hadn’t even thought about
>> tcpdump to catch weather the tasks were heartbeating or not so thanks for
>> the tip, will make sure to check that out! We are also planning our release
>> update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
>> 0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
>> the issue as well, in which case I’ll at least let everyone here know if it
>> does.****
>>
>> ** **
>>
>> Any further ideas or if anyone else has experienced this similar issue my
>> ears are open.  Thanks again Arun! J****
>>
>> ** **
>>
>> *John Miller  **|*  Sr. Linux Systems Administrator**
>>
>> [image: mybuys-ops-small] <http://mybuys.com/>**
>>
>> 530 E. Liberty St.****
>>
>> Ann Arbor, MI 48104****
>>
>> Direct: 734.922.7007****
>>
>> *http://mybuys.com/*
>>
>> ** **
>>
>> *From:* Arun C Murthy [mailto:[EMAIL PROTECTED]]
>> *Sent:* Thursday, December 15, 2011 2:03 PM
>> *To:* [EMAIL PROTECTED]
>> *Subject:* Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout
>> not working)****
>>
>> ** **
>>
>> Hi John,****
>>
>> ** **
>>
>>  It's hard for folks on this list to diagnose CDH (you might have to ask
>> their lists). However, I haven't seen similar issues with hadoop-0.20.2xx
>> in a while.****
>>
>> ** **
>>
>>  One thing to check would be to grab a stack trace (jstack) on the tasks
>> to see what they are upto. Next, try get a tcpdump to see if the tasks are
>> indeed sending heartbeats to the TT, which might be the reason the TTs
>> aren't timing them out.****
>>
>> ** **
>>
>> hth,****
>>
>> Arun****
>>
>> ** **
>>
>> On Dec 15, 2011, at 7:58 AM, John Miller wrote:****
>>
>>
>>
>> ****
>>
>> I’ve recently come across some interesting things happening within a
>> 50-node cluster regarding the tasktrackers and task attempts.  Essentially
>> tasks are being created but they are sticking at 0.0% and it seems the
>> ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for
>> days if we let them) and the jobs have to get killed.  Its interesting to
>> note that the HDFS datanode service and HBASE regionserver running on these
>> nodes work fine and we’ve been simply shutting down the tasktracker service
>> on them in order to get around jobs stalling forever.****
>>
>>  ****
>>
>> Some historical information… We’re running Cloudera’s cdh3u0 release, and
>> this has so far only happened on a handful of random tasktracker nodes and
>> it seems to only effected those that have been taken down for maintenance
>> and then brought back into the cluster, or alternatively one node was
>> brought into the cluster after it had been running for a while and we ran
>> into the same issue.  After re-adding the nodes back into the cluster the
>> tasktracker service starts getting these stalls.  Also know that this has
>> not happened to every node that has been taken out of service for a time
>> and then re-added… I would say about 1/3’rd of them or so has ran into this
>> issue after maintenance.  The particular maintenance issues on the effected
Todd Lipcon
Software Engineer, Cloudera
+
John Miller 2011-12-20, 18:58
+
John Miller 2011-12-28, 19:08
+
John Miller 2011-12-28, 21:36