Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RegionServers Crashing every hour in production env


Copy link to this message
-
Re: RegionServers Crashing every hour in production env
On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:

> 0.94 currently doesn't support hadoop 2.0
>> Can you deploy hadoop 1.1.1 instead ?
>>
>
> I am using cdh4.2.0 which uses this version as default installation.
> I think it will be a problem for me to deploy 1.1.1 because I would need to
> "upgrade" the whole cluster with 70TB of data (backup everything, go
> offline, etc.).
>
> Is there a problem to use cdh4.2.0?
> I should send my email to cdh list?
>
>
That combo should be fine.
>  You Full GC'ing around this time?
>>
>
> The GC shows it took a long time. However it does not make any sense
> to be it, since the same ammount of data was cleaned before and AFTER
> in just 0.01 secs!
>
>
If JVM is full GC'ing, the application is stopped.
>
> [Times: user=0.08 sys=137.62, real=137.62 secs]
>
> Besides the whole time was used by system. That is what is bugging me.
>
>
The below does not look like a full GC but that is a long pause in system
time, enough to kill your zk session.

You swapping?

Hardware is good?

St.Ack

>  ...
>
>
> 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
> 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
>
> 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
> 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
> real=0.00 secs]
>
> 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
> secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
> sys=137.62, real=137.62 secs]
>
> 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
> 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
> real=0.01 secs]
>
> 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
> 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
> real=0.01 secs]
>
>
> I really appreciate you guys helping me to find out what is wrong.
>
> Thanks,
> Pablo
>
>
>
> On 03/08/2013 02:11 PM, Stack wrote:
>
>> What RAM says.
>>
>> 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.****ClientCnxn: Client
>>
>> session timed out, have not heard from server in 159348ms for sessionid
>> 0x13d3c4bcba600a7, closing socket connection and attempting reconnect
>>
>> You Full GC'ing around this time?
>>
>> Put up your configs in a place where we can take a look?
>>
>> St.Ack
>>
>>
>> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
>> ramkrishna.s.vasudevan@gmail.**com <[EMAIL PROTECTED]>>
>> wrote:
>>
>>  I think it is with your GC config.  What is your heap size?  What is the
>>> data that you pump in and how much is the block cache size?
>>>
>>> Regards
>>> Ram
>>>
>>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>>
>>>  0.94 currently doesn't support hadoop 2.0
>>>>
>>>> Can you deploy hadoop 1.1.1 instead ?
>>>>
>>>> Are you using 0.94.5 ?
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:
>>>>
>>>>  Hey guys,
>>>>> as I sent in an email a long time ago, the RSs in my cluster did not
>>>>>
>>>> get
>>>
>>>> along
>>>>> and crashed 3 times a day. I tried a lot of options we discussed in the
>>>>> emails, but it not solved the problem. As I used an old version of
>>>>>
>>>> hadoop I
>>>>
>>>>> thought this was the problem.
>>>>>
>>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to
>>>>>
>>>> hadoop
>>>
>>>> 2.0.0
>>>>> - hbase 0.94 - zookeeper 3.4.5.
>>>>>
>>>>> Unfortunately the RSs did not stop crashing, and worst! Now they crash
>>>>> every
>>>>> hour and some times when the RS that holds the .ROOT. crashes all
>>>>>
>>>> cluster
>>>
>>>> get
>>>>> stuck in transition and everything stops working.
>>>>> In this case I need to clean zookeeper znodes, restart the master and
>>>>>
>>>> the
>>>
>>>> RSs.
>>>>> To avoid this case I am running on production with only ONE RS and a
>>>>> monitoring
>>>>> script that check every minute, if the RS is ok. If not, restart it.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB