Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Jobtracker page hangs ..again.


Copy link to this message
-
Re: Jobtracker page hangs ..again.
Update, after adjust the network routing, dns query speed is in micro sec
as suppose to be. the issue is completely solve.
Jobtracker page doesn't hang anymore when launch 100k mappers job..

Cheers,

On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum <
[EMAIL PROTECTED]> wrote:

> Ok, after some sweat, i think I found the root cause but still need
> another team to help me fix it.
> It lines on the DNS.  Somehow each of the tip:task line, through the
> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>
> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>
> 127 ms
>
> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>
> 126 ms
>
> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>
>
> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>
> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>
> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>
> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>
>
> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>
> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>
> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
> (189)
>
> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>
>
> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>
> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>
> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
> (189)
>
> I looked at the jobtracker log in other datacenter when submitted with the
> same query. Timestamp in each tip:task line is much much faster.
>
> The question that raise here is the job initialization is really request
> the DNS, if so is there any way to suppress that. topology file is already
> in place where name and ip are already there.
>
>
> Hope this make sense
>
> Patai
>
>
>
>
> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
> [EMAIL PROTECTED]> wrote:
>
>> Appreciate your input Bryant, i will try to reproduce and see the
>> namenode log before, while, and after it pause.
>> Wish me luck
>>
>>
>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
>> [EMAIL PROTECTED]> wrote:
>>
>>> When I've had problems with a slow jobtracker, i've found the issue to
>>> be one of the following two (so far) possibilities:
>>>
>>> - long GC pause (I'm guessing this is not it based on your email)
>>> - hdfs is slow
>>>
>>> I haven't dived into the code yet, but circumstantially I've found that
>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>> larger and larger jobs, aside form the size of the splits serialization in
>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>> the 50030 web ui.
>>>
>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>> before and after that pause?
>>>
>>>
>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>>> usual which ends up because of allowing too much job history kept in
>>>> jobtracker.
>>>>
>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>> tried to test out the cluster by launching a 76k map job (the cluster has