Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> ConnectionException in container, happens only sometimes


Copy link to this message
-
RE: ConnectionException in container, happens only sometimes
>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It is trying to connect to MRAppMaster for executing the actual task.

>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It seems Container is not getting the correct MRAppMaster address due to some reason or AM is crashing before giving the task to Container. Probably it is coming due to invalid host mapping.  Can you check the host mapping is proper in both the machines and also check the AM log that time for any clue.

Thanks
Devaraj k

From: Andrei [mailto:[EMAIL PROTECTED]]
Sent: 10 July 2013 17:32
To: [EMAIL PROTECTED]
Subject: ConnectionException in container, happens only sometimes

Hi,

I'm running CDH4.3 installation of Hadoop with the following simple setup:

master-host: runs NameNode, ResourceManager and JobHistoryServer
slave-1-host and slave-2-hosts: DataNodes and NodeManagers.

When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail:

13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000003_0, Status : FAILED
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000005_0, Status : FAILED
...
13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%
...

Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following.

Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents:

...
2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...
2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/127.0.0.1<http://127.0.0.1> to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
        at org.apache.hadoop.ipc.Client.call(Client.java:1229)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
        at com.sun.proxy.$Proxy6.getTask(Unknown Source)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)
        at org.apache.hadoop.ipc.Client.call(Client.java:1196)
        ... 3 more
Notice several things:

1. This exception always happens on the different host than ApplicationMaster runs on.
2. It always tries to connect to localhost, not other host in cluster.
3. Port number (11812 in this case) is always different.

My questions are:

1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
2. Why this error happens and how can I fix it?

Any suggestions are welcome.

Thanks,
Andrei
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB