Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Cluster crash


Copy link to this message
-
Re: Cluster crash
After 3 days of running with the configuration changes recommended by J-D
the cluster seems stable now.
For the benefit of others I would say there were two issues identified:
First, the HBASE_HEAP was set too high. It turns out that each Haddop daemon
takes at least 1GB at startup even if it's doing nothing. Since we have a
data node, a task tracker and a thrift server running on each machine those
take up 3GB or RAM that must be accounted for when allocating memory for the
region server.
Second, we had "-XX:+CMSIncrementalMode" configured, which apparently is not
good with multi-core systems.

Thanks J-D for all the help.

-eran

On Mon, Apr 11, 2011 at 23:53, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:

> Alright so I was able to get the logs from Eran, the HDFS errors are a
> red herring, what followed in the region server log that is really
> important is:
>
> 2011-04-10 10:14:27,278 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 144490ms for
> sessionid 0x12ee42283320050, closing socket connection and attempting
> reconnect
>
> Which is a 2m20s GC pause. The HDFS errors come from the fact that the
> master split the logs _while_ the region server was sleeping.
>
> J-D
>
> On Mon, Apr 11, 2011 at 11:47 AM, Jean-Daniel Cryans
> <[EMAIL PROTECTED]> wrote:
> > So my understanding is that this log file was opened at 7:29 and then
> > something happened at 10:12:55 as something triggered the recovery on
> > that block. It triggered a recovery of the block with the new name
> > being blk_1213779416283711358_54249
> >
> > It seems that that process was started by the DFS Client at 10:12:55
> > but the RS log starts at 10:14. Would it be possible to see what was
> > before that? Also it would be nice to have a view for those blocks on
> > all the datanodes.
> >
> > It would be nice to do this debugging on IRC is it can require a lot
> > of back and forth.
> >
> > J-D
> >
> > On Mon, Apr 11, 2011 at 11:22 AM, Eran Kutner <eran@.com> wrote:
> >> There wasn't an attachment, I pasted all the lines from all the NN logs
> that
> >> contain that particular block number inline.
> >>
> >> As for CPU/IO, first there is nothing else running on those servers,
> second,
> >> CPU utilization on the slaves at peak load was around 40% and disk IO
> >> utilization less than 20%. That's the strange thing about it (I have
> another
> >> thread going about the performance), there is no bottleneck I could
> identify
> >> and yet the performance was relatively low, compared to the numbers I
> see
> >> quoted for HBase in other places.
> >>
> >> The first line of the NN log says:
> >> BLOCK* NameSystem.allocateBlock:
> >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/
> hadoop1-s01.farm-ny.gigya.com
> %3A60020.1302434963279.blk_1213779416283711358_54194
> >> So it looks like a file name is:
> >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/
> hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279
> >>
> >> Is there a better way to associate a file with a block?
> >>
> >> -eran
> >>
> >>
> >>
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB