You might also consider federation.
On 11/3/2013 3:21 AM, Manish Malhotra wrote:
> Hi All,
> I'm facing issues in scaling a Hadoop cluster, I have following
> cluster config.
> 1. AWS Infrastructure.
> 2. 400 DN
> 3. NN :
> 120 gb memory, 10gb network,32 cores
> dfs.namenode.handler.count = 128
> ipc queue size = 128 ( default)
> 4. DN: 15.5 gb memory. 1 gb network, 8cores
> 5. Hadoop version: 1.0.2
> Problem: Sometime NN becomes unstable, and started showing DN's as down.
> But actually DNs are running.
> I have seen "Socket timeout exception" from DN and also " xrecievers
> Looks like the NN is busy for that time, and suddenly it start loosing
> the hearbeat of DNs.
> Once it sees DNs are down, it start replicating blocks to other nodes,
> but then again more nodes become unavailable and again it tries to
> replicate those blocks.
> This is like a cycle where NN trapped, and not able to come out.
> NN looks good from Memory and CPU usage point of view.
> Maximum it uses 150% CPU, I believe 1.0.2 version is not using multi
> cores, and uses single core only
> Potential Reasons:
> 1. Small files, we have lots and lots of small files, we are working
> on it.
> 2. AWS Infra is not reliable, so should increase the
> "datanode.recheck.interval" property to give more time before
> declaring DN as dead.
> 3. Lots of connections to NN from clients and MR jobs.
> 4. DNs have issues in terms of Memory / Threads, so that its actually
> not even connecting to the NN.
> But have not seen the OOM issue, yet.
> 5. NN threaddump at the time of issue, showing all the Handler threads
> are in waiting for lock state.
> If anybody has similar experience with Hadoop on AWS or any infra and
> can give some input that will be great.