Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RegionServers Crashing every hour in production env


Copy link to this message
-
Re: RegionServers Crashing every hour in production env
On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:

> 0.94 currently doesn't support hadoop 2.0
>> Can you deploy hadoop 1.1.1 instead ?
>>
>
> I am using cdh4.2.0 which uses this version as default installation.
> I think it will be a problem for me to deploy 1.1.1 because I would need to
> "upgrade" the whole cluster with 70TB of data (backup everything, go
> offline, etc.).
>
> Is there a problem to use cdh4.2.0?
> I should send my email to cdh list?
>
>
That combo should be fine.
>  You Full GC'ing around this time?
>>
>
> The GC shows it took a long time. However it does not make any sense
> to be it, since the same ammount of data was cleaned before and AFTER
> in just 0.01 secs!
>
>
If JVM is full GC'ing, the application is stopped.
>
> [Times: user=0.08 sys=137.62, real=137.62 secs]
>
> Besides the whole time was used by system. That is what is bugging me.
>
>
The below does not look like a full GC but that is a long pause in system
time, enough to kill your zk session.

You swapping?

Hardware is good?

St.Ack

>  ...
>
>
> 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
> 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
>
> 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
> 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
> real=0.00 secs]
>
> 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
> secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
> sys=137.62, real=137.62 secs]
>
> 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
> 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
> real=0.01 secs]
>
> 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
> 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
> real=0.01 secs]
>
>
> I really appreciate you guys helping me to find out what is wrong.
>
> Thanks,
> Pablo
>
>
>
> On 03/08/2013 02:11 PM, Stack wrote:
>
>> What RAM says.
>>
>> 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.****ClientCnxn: Client
>>
>> session timed out, have not heard from server in 159348ms for sessionid
>> 0x13d3c4bcba600a7, closing socket connection and attempting reconnect
>>
>> You Full GC'ing around this time?
>>
>> Put up your configs in a place where we can take a look?
>>
>> St.Ack
>>
>>
>> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
>> ramkrishna.s.vasudevan@gmail.**com <[EMAIL PROTECTED]>>
>> wrote:
>>
>>  I think it is with your GC config.  What is your heap size?  What is the
>>> data that you pump in and how much is the block cache size?
>>>
>>> Regards
>>> Ram
>>>
>>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>>
>>>  0.94 currently doesn't support hadoop 2.0
>>>>
>>>> Can you deploy hadoop 1.1.1 instead ?
>>>>
>>>> Are you using 0.94.5 ?
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:
>>>>
>>>>  Hey guys,
>>>>> as I sent in an email a long time ago, the RSs in my cluster did not
>>>>>
>>>> get
>>>
>>>> along
>>>>> and crashed 3 times a day. I tried a lot of options we discussed in the
>>>>> emails, but it not solved the problem. As I used an old version of
>>>>>
>>>> hadoop I
>>>>
>>>>> thought this was the problem.
>>>>>
>>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to
>>>>>
>>>> hadoop
>>>
>>>> 2.0.0
>>>>> - hbase 0.94 - zookeeper 3.4.5.
>>>>>
>>>>> Unfortunately the RSs did not stop crashing, and worst! Now they crash
>>>>> every
>>>>> hour and some times when the RS that holds the .ROOT. crashes all
>>>>>
>>>> cluster
>>>
>>>> get
>>>>> stuck in transition and everything stops working.
>>>>> In this case I need to clean zookeeper znodes, restart the master and
>>>>>
>>>> the
>>>
>>>> RSs.
>>>>> To avoid this case I am running on production with only ONE RS and a
>>>>> monitoring
>>>>> script that check every minute, if the RS is ok. If not, restart it.