Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> RegionServers Crashing every hour in production env


+
Pablo Musa 2013-03-08, 15:44
+
Ted Yu 2013-03-08, 16:01
Copy link to this message
-
Re: RegionServers Crashing every hour in production env
I think it is with your GC config.  What is your heap size?  What is the
data that you pump in and how much is the block cache size?

Regards
Ram

On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> 0.94 currently doesn't support hadoop 2.0
>
> Can you deploy hadoop 1.1.1 instead ?
>
> Are you using 0.94.5 ?
>
> Thanks
>
> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:
>
> > Hey guys,
> > as I sent in an email a long time ago, the RSs in my cluster did not get
> > along
> > and crashed 3 times a day. I tried a lot of options we discussed in the
> > emails, but it not solved the problem. As I used an old version of
> hadoop I
> > thought this was the problem.
> >
> > So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to hadoop
> > 2.0.0
> > - hbase 0.94 - zookeeper 3.4.5.
> >
> > Unfortunately the RSs did not stop crashing, and worst! Now they crash
> > every
> > hour and some times when the RS that holds the .ROOT. crashes all cluster
> > get
> > stuck in transition and everything stops working.
> > In this case I need to clean zookeeper znodes, restart the master and the
> > RSs.
> > To avoid this case I am running on production with only ONE RS and a
> > monitoring
> > script that check every minute, if the RS is ok. If not, restart it.
> > * This case does not get the cluster stuck.
> >
> > This is driving me crazy, but I really cant find a solution for the
> > cluster.
> > I tracked all logs from the start time 16:49 from all interesting nodes
> > (zoo,
> > namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
> > usefull.
> >
> > There are some strange errors in the DATANODE2, as an error copiyng a
> block
> > to itself.
> >
> > The gc log points to GC timeout. However it is very weird that the RS
> spend
> > so much time in GC while in the other cases it takes 0.001sec. Besides,
> > the time
> > spent, is in sys which makes me think that might be a problem in another
> > place.
> >
> > I know that it is a bunch of logs, and that it is very difficult to find
> > the
> > problem without much context. But I REALLY need some help. If it is not
> the
> > solution, at least what I should read, where I should look, or which
> cases
> > I
> > should monitor.
> >
> > Thank you very much,
> > Pablo Musa
> >
>
+
Stack 2013-03-08, 17:11
+
Pablo Musa 2013-03-08, 18:58
+
Stack 2013-03-08, 22:02
+
Pablo Musa 2013-03-10, 18:59
+
Sreepathi 2013-03-10, 19:06
+
Pablo Musa 2013-03-10, 22:29
+
Stack 2013-03-10, 22:41
+
Azuryy Yu 2013-03-11, 02:13
+
Andrew Purtell 2013-03-11, 02:24
+
Pablo Musa 2013-03-12, 15:43
+
Pablo Musa 2013-04-03, 18:21
+
Ted Yu 2013-04-03, 18:36
+
Pablo Musa 2013-04-03, 20:24
+
Ted Yu 2013-04-03, 21:40
+
Azuryy Yu 2013-03-11, 02:14
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB