Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - RegionServers Crashing every hour in production env


Copy link to this message
-
Re: RegionServers Crashing every hour in production env
Stack 2013-03-08, 17:11
What RAM says.

2013-03-07 17:24:57,887 INFO org.apache.zookeeper.**ClientCnxn: Client
session timed out, have not heard from server in 159348ms for sessionid
0x13d3c4bcba600a7, closing socket connection and attempting reconnect

You Full GC'ing around this time?

Put up your configs in a place where we can take a look?

St.Ack
On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
[EMAIL PROTECTED]> wrote:

> I think it is with your GC config.  What is your heap size?  What is the
> data that you pump in and how much is the block cache size?
>
> Regards
> Ram
>
> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > 0.94 currently doesn't support hadoop 2.0
> >
> > Can you deploy hadoop 1.1.1 instead ?
> >
> > Are you using 0.94.5 ?
> >
> > Thanks
> >
> > On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:
> >
> > > Hey guys,
> > > as I sent in an email a long time ago, the RSs in my cluster did not
> get
> > > along
> > > and crashed 3 times a day. I tried a lot of options we discussed in the
> > > emails, but it not solved the problem. As I used an old version of
> > hadoop I
> > > thought this was the problem.
> > >
> > > So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to
> hadoop
> > > 2.0.0
> > > - hbase 0.94 - zookeeper 3.4.5.
> > >
> > > Unfortunately the RSs did not stop crashing, and worst! Now they crash
> > > every
> > > hour and some times when the RS that holds the .ROOT. crashes all
> cluster
> > > get
> > > stuck in transition and everything stops working.
> > > In this case I need to clean zookeeper znodes, restart the master and
> the
> > > RSs.
> > > To avoid this case I am running on production with only ONE RS and a
> > > monitoring
> > > script that check every minute, if the RS is ok. If not, restart it.
> > > * This case does not get the cluster stuck.
> > >
> > > This is driving me crazy, but I really cant find a solution for the
> > > cluster.
> > > I tracked all logs from the start time 16:49 from all interesting nodes
> > > (zoo,
> > > namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
> > > usefull.
> > >
> > > There are some strange errors in the DATANODE2, as an error copiyng a
> > block
> > > to itself.
> > >
> > > The gc log points to GC timeout. However it is very weird that the RS
> > spend
> > > so much time in GC while in the other cases it takes 0.001sec. Besides,
> > > the time
> > > spent, is in sys which makes me think that might be a problem in
> another
> > > place.
> > >
> > > I know that it is a bunch of logs, and that it is very difficult to
> find
> > > the
> > > problem without much context. But I REALLY need some help. If it is not
> > the
> > > solution, at least what I should read, where I should look, or which
> > cases
> > > I
> > > should monitor.
> > >
> > > Thank you very much,
> > > Pablo Musa
> > >
> >
>