Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Jobtracker page hangs ..again.


Copy link to this message
-
Re: Jobtracker page hangs ..again.
Thanks harsh,
Appreciate your input, as always.

On Aug 12, 2013, at 20:01, Harsh J <[EMAIL PROTECTED]> wrote:

> If you're not already doing it, run a local name caching daemon (such
> as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
> a local cache would do good in reducing the load on your DNS.
>
> On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
> <[EMAIL PROTECTED]> wrote:
>> Update, after adjust the network routing, dns query speed is in micro sec as
>> suppose to be. the issue is completely solve.
>> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>>
>> Cheers,
>>
>>
>>
>> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> Ok, after some sweat, i think I found the root cause but still need
>>> another team to help me fix it.
>>> It lines on the DNS.  Somehow each of the tip:task line, through the
>>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>>
>>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>>
>>> 127 ms
>>>
>>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>>
>>> 126 ms
>>>
>>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>>
>>>
>>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>>
>>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>>
>>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>>
>>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>>
>>>
>>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>>
>>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>>
>>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>>> (189)
>>>
>>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>>
>>>
>>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>>
>>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>>
>>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>>> (189)
>>>
>>> I looked at the jobtracker log in other datacenter when submitted with the
>>> same query. Timestamp in each tip:task line is much much faster.
>>>
>>> The question that raise here is the job initialization is really request
>>> the DNS, if so is there any way to suppress that. topology file is already
>>> in place where name and ip are already there.
>>>
>>>
>>> Hope this make sense
>>>
>>> Patai
>>>
>>>
>>>
>>>
>>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>>> <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Appreciate your input Bryant, i will try to reproduce and see the
>>>> namenode log before, while, and after it pause.
>>>> Wish me luck
>>>>
>>>>
>>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>>> <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>>> be one of the following two (so far) possibilities:
>>>>>
>>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>>> - hdfs is slow
>>>>>
>>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>>> the 50030 web ui.
>>>>>
>>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>>> do you see a corresponding pause in the namenode logs?  What gets spewed