Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> ConnectionException in container, happens only sometimes


Copy link to this message
-
Re: ConnectionException in container, happens only sometimes
If it helps, full log of AM can be found here <http://pastebin.com/zXTabyvv>
.
On Wed, Jul 10, 2013 at 4:21 PM, Andrei <[EMAIL PROTECTED]> wrote:

> Hi Devaraj,
>
> thanks for your answer. Yes, I suspected it could be because of host
> mapping, so I have already checked (and have just re-checked) settings in
> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
> `master-host`), so it shouldn't depend on notation too.
>
> I have also checked AM syslog. There's nothing about network, but there
> are several messages like the following:
>
> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>
>
> I understand container just doesn't get registered in AM (probably because
> of the same issue), is it correct? So I wonder who sends "container
> complete event" to ApplicationMaster?
>
>
>
>
>
> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <[EMAIL PROTECTED]> wrote:
>
>>  >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It is trying to connect to MRAppMaster for executing the actual task.****
>>
>> ** **
>>
>> >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It seems Container is not getting the correct MRAppMaster address due to
>> some reason or AM is crashing before giving the task to Container. Probably
>> it is coming due to invalid host mapping.  Can you check the host mapping
>> is proper in both the machines and also check the AM log that time for any
>> clue. ****
>>
>> ** **
>>
>> Thanks****
>>
>> Devaraj k****
>>
>> ** **
>>
>> *From:* Andrei [mailto:[EMAIL PROTECTED]]
>> *Sent:* 10 July 2013 17:32
>> *To:* [EMAIL PROTECTED]
>> *Subject:* ConnectionException in container, happens only sometimes****
>>
>> ** **
>>
>> Hi, ****
>>
>> ** **
>>
>> I'm running CDH4.3 installation of Hadoop with the following simple
>> setup: ****
>>
>> ** **
>>
>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>
>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>
>> ** **
>>
>> When I run simple MapReduce job (both - using streaming API or Pi example
>> from distribution) on client I see that some tasks fail: ****
>>
>> ** **
>>
>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>
>> ...****
>>
>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>
>> ...****
>>
>> ** **
>>
>> Every time different set of tasks/attempts fails. In some cases number of
>> failed attempts becomes critical, and the whole job fails, in other cases
>> job is finished successfully. I can't see any dependency, but I noticed the
>> following. ****
>>
>> ** **
>>
>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>> _slave-2-host_ there will be corresponding syslog with the following
>> contents: ****
>>
>> ** **
>>
>> ... ****
>>
>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 1 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> ...****
>>
>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB