Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RegionServers Crashing every hour in production env


Copy link to this message
-
Re: RegionServers Crashing every hour in production env
0.94 currently doesn't support hadoop 2.0

Can you deploy hadoop 1.1.1 instead ?

Are you using 0.94.5 ?

Thanks

On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:

> Hey guys,
> as I sent in an email a long time ago, the RSs in my cluster did not get
> along
> and crashed 3 times a day. I tried a lot of options we discussed in the
> emails, but it not solved the problem. As I used an old version of hadoop I
> thought this was the problem.
>
> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to hadoop
> 2.0.0
> - hbase 0.94 - zookeeper 3.4.5.
>
> Unfortunately the RSs did not stop crashing, and worst! Now they crash
> every
> hour and some times when the RS that holds the .ROOT. crashes all cluster
> get
> stuck in transition and everything stops working.
> In this case I need to clean zookeeper znodes, restart the master and the
> RSs.
> To avoid this case I am running on production with only ONE RS and a
> monitoring
> script that check every minute, if the RS is ok. If not, restart it.
> * This case does not get the cluster stuck.
>
> This is driving me crazy, but I really cant find a solution for the
> cluster.
> I tracked all logs from the start time 16:49 from all interesting nodes
> (zoo,
> namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
> usefull.
>
> There are some strange errors in the DATANODE2, as an error copiyng a block
> to itself.
>
> The gc log points to GC timeout. However it is very weird that the RS spend
> so much time in GC while in the other cases it takes 0.001sec. Besides,
> the time
> spent, is in sys which makes me think that might be a problem in another
> place.
>
> I know that it is a bunch of logs, and that it is very difficult to find
> the
> problem without much context. But I REALLY need some help. If it is not the
> solution, at least what I should read, where I should look, or which cases
> I
> should monitor.
>
> Thank you very much,
> Pablo Musa
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB