Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> RegionServers Crashing every hour in production env


+
Pablo Musa 2013-03-08, 15:44
Copy link to this message
-
Re: RegionServers Crashing every hour in production env
0.94 currently doesn't support hadoop 2.0

Can you deploy hadoop 1.1.1 instead ?

Are you using 0.94.5 ?

Thanks

On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:

> Hey guys,
> as I sent in an email a long time ago, the RSs in my cluster did not get
> along
> and crashed 3 times a day. I tried a lot of options we discussed in the
> emails, but it not solved the problem. As I used an old version of hadoop I
> thought this was the problem.
>
> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to hadoop
> 2.0.0
> - hbase 0.94 - zookeeper 3.4.5.
>
> Unfortunately the RSs did not stop crashing, and worst! Now they crash
> every
> hour and some times when the RS that holds the .ROOT. crashes all cluster
> get
> stuck in transition and everything stops working.
> In this case I need to clean zookeeper znodes, restart the master and the
> RSs.
> To avoid this case I am running on production with only ONE RS and a
> monitoring
> script that check every minute, if the RS is ok. If not, restart it.
> * This case does not get the cluster stuck.
>
> This is driving me crazy, but I really cant find a solution for the
> cluster.
> I tracked all logs from the start time 16:49 from all interesting nodes
> (zoo,
> namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
> usefull.
>
> There are some strange errors in the DATANODE2, as an error copiyng a block
> to itself.
>
> The gc log points to GC timeout. However it is very weird that the RS spend
> so much time in GC while in the other cases it takes 0.001sec. Besides,
> the time
> spent, is in sys which makes me think that might be a problem in another
> place.
>
> I know that it is a bunch of logs, and that it is very difficult to find
> the
> problem without much context. But I REALLY need some help. If it is not the
> solution, at least what I should read, where I should look, or which cases
> I
> should monitor.
>
> Thank you very much,
> Pablo Musa
>
+
ramkrishna vasudevan 2013-03-08, 16:32
+
Stack 2013-03-08, 17:11
+
Pablo Musa 2013-03-08, 18:58
+
Stack 2013-03-08, 22:02
+
Pablo Musa 2013-03-10, 18:59
+
Sreepathi 2013-03-10, 19:06
+
Pablo Musa 2013-03-10, 22:29
+
Stack 2013-03-10, 22:41
+
Azuryy Yu 2013-03-11, 02:13
+
Andrew Purtell 2013-03-11, 02:24
+
Pablo Musa 2013-03-12, 15:43
+
Pablo Musa 2013-04-03, 18:21
+
Ted Yu 2013-04-03, 18:36
+
Pablo Musa 2013-04-03, 20:24
+
Ted Yu 2013-04-03, 21:40
+
Azuryy Yu 2013-03-11, 02:14