Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Namenode / Cluster scaling issues in AWS environment


Copy link to this message
-
Re: Namenode / Cluster scaling issues in AWS environment
You might also consider  federation.
Chris
On 11/3/2013 3:21 AM, Manish Malhotra wrote:
> Hi All,
>
> I'm facing issues in scaling a Hadoop cluster, I have following
> cluster config.
>
>
> 1. AWS Infrastructure.
> 2. 400 DN
> 3. NN :
>             120 gb memory, 10gb network,32 cores
>             dfs.namenode.handler.count = 128
>              ipc queue size = 128 ( default)
> 4. DN: 15.5 gb memory. 1 gb network, 8cores
> 5. Hadoop version: 1.0.2
>
>
> Problem: Sometime NN becomes unstable, and started showing DN's as down.
> But actually DNs are running.
> I have seen "Socket timeout exception" from DN and also " xrecievers
> Exception".
> Looks like the NN is busy for that time, and suddenly it start loosing
> the hearbeat of DNs.
> Once it sees DNs are down, it start replicating blocks to other nodes,
> but then again more nodes become unavailable and again it tries to
> replicate those blocks.
> This is like a cycle where NN trapped, and not able to come out.
> NN looks good from Memory and CPU usage point of view.
> Maximum it uses 150% CPU, I believe 1.0.2 version is not using multi
> cores, and uses single core only
>
> Potential Reasons:
>
> 1. Small files, we have lots and lots of small files, we are working
> on it.
> 2. AWS Infra is not reliable, so should increase the
> "datanode.recheck.interval" property to give more time before
> declaring DN as dead.
> 3. Lots of connections to NN from clients and MR jobs.
> 4. DNs have issues in terms of Memory / Threads, so that its actually
> not even connecting to the NN.
> But have not seen the OOM issue, yet.
>
> 5. NN threaddump at the time of issue, showing all the Handler threads
> are in waiting for lock state.
>
> If anybody has similar experience with Hadoop on AWS or any infra and
> can give some input that will be great.
>
> Regards,
> Manish
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB