Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> ConnectionException in container, happens only sometimes


Copy link to this message
-
Re: ConnectionException in container, happens only sometimes
If it helps, full log of AM can be found here <http://pastebin.com/zXTabyvv>
.
On Wed, Jul 10, 2013 at 4:21 PM, Andrei <[EMAIL PROTECTED]> wrote:

> Hi Devaraj,
>
> thanks for your answer. Yes, I suspected it could be because of host
> mapping, so I have already checked (and have just re-checked) settings in
> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
> `master-host`), so it shouldn't depend on notation too.
>
> I have also checked AM syslog. There's nothing about network, but there
> are several messages like the following:
>
> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>
>
> I understand container just doesn't get registered in AM (probably because
> of the same issue), is it correct? So I wonder who sends "container
> complete event" to ApplicationMaster?
>
>
>
>
>
> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <[EMAIL PROTECTED]> wrote:
>
>>  >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It is trying to connect to MRAppMaster for executing the actual task.****
>>
>> ** **
>>
>> >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It seems Container is not getting the correct MRAppMaster address due to
>> some reason or AM is crashing before giving the task to Container. Probably
>> it is coming due to invalid host mapping.  Can you check the host mapping
>> is proper in both the machines and also check the AM log that time for any
>> clue. ****
>>
>> ** **
>>
>> Thanks****
>>
>> Devaraj k****
>>
>> ** **
>>
>> *From:* Andrei [mailto:[EMAIL PROTECTED]]
>> *Sent:* 10 July 2013 17:32
>> *To:* [EMAIL PROTECTED]
>> *Subject:* ConnectionException in container, happens only sometimes****
>>
>> ** **
>>
>> Hi, ****
>>
>> ** **
>>
>> I'm running CDH4.3 installation of Hadoop with the following simple
>> setup: ****
>>
>> ** **
>>
>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>
>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>
>> ** **
>>
>> When I run simple MapReduce job (both - using streaming API or Pi example
>> from distribution) on client I see that some tasks fail: ****
>>
>> ** **
>>
>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>
>> ...****
>>
>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>
>> ...****
>>
>> ** **
>>
>> Every time different set of tasks/attempts fails. In some cases number of
>> failed attempts becomes critical, and the whole job fails, in other cases
>> job is finished successfully. I can't see any dependency, but I noticed the
>> following. ****
>>
>> ** **
>>
>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>> _slave-2-host_ there will be corresponding syslog with the following
>> contents: ****
>>
>> ** **
>>
>> ... ****
>>
>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 1 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> ...****
>>
>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried