Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)


Copy link to this message
-
Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)
On Mon, Dec 19, 2011 at 7:29 PM, rajesh balamohan <[EMAIL PROTECTED]>wrote:

> Hi John,
>
> Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what are the JVM
> arguments you use for the spawning the map/reduce slots?
>
> Check if the JVM is stuck in the machine. Sometimes I have seen task JVM
> just launching, gets into spinning mode and occupies 100% CPU.
>

Yep, this one that Rajesh mentions is a RHEL 6 bug:
https://bugzilla.redhat.com/show_bug.cgi?id=750419
We can reproduce it in our RHEL6 QA clusters pretty reilably, but still
working with RedHat to reproduce/fix.

Thanks
-Todd

>
>
> On Fri, Dec 16, 2011 at 2:26 AM, John Miller <[EMAIL PROTECTED]> wrote:
>
>> Hello Arun,****
>>
>> ** **
>>
>> Thanks for the quick reply.  I totally understand the CDH issue but
>> figured I’d ask the broader community as well in case there was any
>> upstream known issue as I’ve noticed some patches relating to “somewhat
>> similar” issues.****
>>
>> ** **
>>
>> The jstack was currently on my radar but I hadn’t even thought about
>> tcpdump to catch weather the tasks were heartbeating or not so thanks for
>> the tip, will make sure to check that out! We are also planning our release
>> update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
>> 0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
>> the issue as well, in which case I’ll at least let everyone here know if it
>> does.****
>>
>> ** **
>>
>> Any further ideas or if anyone else has experienced this similar issue my
>> ears are open.  Thanks again Arun! J****
>>
>> ** **
>>
>> *John Miller  **|*  Sr. Linux Systems Administrator**
>>
>> [image: mybuys-ops-small] <http://mybuys.com/>**
>>
>> 530 E. Liberty St.****
>>
>> Ann Arbor, MI 48104****
>>
>> Direct: 734.922.7007****
>>
>> *http://mybuys.com/*
>>
>> ** **
>>
>> *From:* Arun C Murthy [mailto:[EMAIL PROTECTED]]
>> *Sent:* Thursday, December 15, 2011 2:03 PM
>> *To:* [EMAIL PROTECTED]
>> *Subject:* Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout
>> not working)****
>>
>> ** **
>>
>> Hi John,****
>>
>> ** **
>>
>>  It's hard for folks on this list to diagnose CDH (you might have to ask
>> their lists). However, I haven't seen similar issues with hadoop-0.20.2xx
>> in a while.****
>>
>> ** **
>>
>>  One thing to check would be to grab a stack trace (jstack) on the tasks
>> to see what they are upto. Next, try get a tcpdump to see if the tasks are
>> indeed sending heartbeats to the TT, which might be the reason the TTs
>> aren't timing them out.****
>>
>> ** **
>>
>> hth,****
>>
>> Arun****
>>
>> ** **
>>
>> On Dec 15, 2011, at 7:58 AM, John Miller wrote:****
>>
>>
>>
>> ****
>>
>> I’ve recently come across some interesting things happening within a
>> 50-node cluster regarding the tasktrackers and task attempts.  Essentially
>> tasks are being created but they are sticking at 0.0% and it seems the
>> ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for
>> days if we let them) and the jobs have to get killed.  Its interesting to
>> note that the HDFS datanode service and HBASE regionserver running on these
>> nodes work fine and we’ve been simply shutting down the tasktracker service
>> on them in order to get around jobs stalling forever.****
>>
>>  ****
>>
>> Some historical information… We’re running Cloudera’s cdh3u0 release, and
>> this has so far only happened on a handful of random tasktracker nodes and
>> it seems to only effected those that have been taken down for maintenance
>> and then brought back into the cluster, or alternatively one node was
>> brought into the cluster after it had been running for a while and we ran
>> into the same issue.  After re-adding the nodes back into the cluster the
>> tasktracker service starts getting these stalls.  Also know that this has
>> not happened to every node that has been taken out of service for a time
>> and then re-added… I would say about 1/3’rd of them or so has ran into this
>> issue after maintenance.  The particular maintenance issues on the effected
Todd Lipcon
Software Engineer, Cloudera
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB