Ok, after some sweat, i think I found the root cause but still need another
team to help me fix it.
It lines on the DNS. Somehow each of the tip:task line, through the
tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy (189)
20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz (189)
I looked at the jobtracker log in other datacenter when submitted with the
same query. Timestamp in each tip:task line is much much faster.
The question that raise here is the job initialization is really request
the DNS, if so is there any way to suppress that. topology file is already
in place where name and ip are already there.
Hope this make sense
On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
[EMAIL PROTECTED]> wrote:
> Appreciate your input Bryant, i will try to reproduce and see the namenode
> log before, while, and after it pause.
> Wish me luck
> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
> [EMAIL PROTECTED]> wrote:
>> When I've had problems with a slow jobtracker, i've found the issue to be
>> one of the following two (so far) possibilities:
>> - long GC pause (I'm guessing this is not it based on your email)
>> - hdfs is slow
>> I haven't dived into the code yet, but circumstantially I've found that
>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>> such as the job.xml, the job jar, etc. I'm not sure how this scales with
>> larger and larger jobs, aside form the size of the splits serialization in
>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>> your jobtracker. This affects other jobs being able to submit, as well as
>> the 50030 web ui.
>> I'd take a look at your namenode logs. When the jobtracker logs pause,
>> do you see a corresponding pause in the namenode logs? What gets spewed
>> before and after that pause?
>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>> [EMAIL PROTECTED]> wrote:
>>> A while back, i was fighting with the jobtracker page hangs when i
>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>> usual which ends up because of allowing too much job history kept in
>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>> around 6k-ish mappers)
>>> Job initialization was success, and finished the job.
>>> However, before the job is actually running, i can't access to the
>>> jobtracker page anymore same symptom as above.
>>> i see bunch of this in jobtracker log
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node