Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - RegionServers Crashing every hour in production env


+
Pablo Musa 2013-03-08, 15:44
+
Ted Yu 2013-03-08, 16:01
+
ramkrishna vasudevan 2013-03-08, 16:32
+
Stack 2013-03-08, 17:11
+
Pablo Musa 2013-03-08, 18:58
+
Stack 2013-03-08, 22:02
+
Pablo Musa 2013-03-10, 18:59
+
Sreepathi 2013-03-10, 19:06
+
Pablo Musa 2013-03-10, 22:29
+
Stack 2013-03-10, 22:41
+
Azuryy Yu 2013-03-11, 02:13
+
Andrew Purtell 2013-03-11, 02:24
+
Pablo Musa 2013-03-12, 15:43
+
Pablo Musa 2013-04-03, 18:21
+
Ted Yu 2013-04-03, 18:36
+
Pablo Musa 2013-04-03, 20:24
+
Ted Yu 2013-04-03, 21:40
Copy link to this message
-
Re: RegionServers Crashing every hour in production env
Azuryy Yu 2013-03-11, 02:14
Pablo,

another, what's your java version?
On Mon, Mar 11, 2013 at 10:13 AM, Azuryy Yu <[EMAIL PROTECTED]> wrote:

> Hi Pablo,
> It'a terrible for a long minor GC. I don't think there are swaping from
> your vmstat log.
> but I just suggest you
> 1) add following JVM options:
> -XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:GCTimeRatio=19
> -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2
> -XX:MaxTenuringThreshold=3 -XX:+UseFastAccessorMethods
>
> 2) -Xmn is two small, your total Mem is 74GB, just make -Xmn2g
> 3) what are you doing during long GC happened? read or write? if reading,
> what the block cache size?
>
>
>
>
> On Mon, Mar 11, 2013 at 6:41 AM, Stack <[EMAIL PROTECTED]> wrote:
>
>> You could increase your zookeeper session timeout to 5 minutes while you
>> are figuring why these long pauses.
>> http://hbase.apache.org/book.html#zookeeper.session.timeout
>>
>> Above, there is an outage for almost 5 minutes:
>>
>> >> We slept 225100ms instead of 3000ms, this is likely due to a long
>>
>> You have ganglia or tsdb running?  When you see the big pause above, can
>> you see anything going on on the machine?  (swap, iowait, concurrent fat
>> mapreduce job?)
>>
>> St.Ack
>>
>>
>>
>> On Sun, Mar 10, 2013 at 3:29 PM, Pablo Musa <[EMAIL PROTECTED]> wrote:
>>
>> > Hi Sreepathi,
>> > they say in the book (or the site), we could try it to see if it is
>> really
>> > a timeout error
>> > or there is something more. But it is not recomended for production
>> > environments.
>> >
>> > I could give it a try if five minutes will ensure to us that the problem
>> > is the GC or
>> > elsewhere!! Anyway, I think it is hard to beleive a GC is taking 2:30
>> > minutes.
>> >
>> > Abs,
>> > Pablo
>> >
>> >
>> > On 03/10/2013 04:06 PM, Sreepathi wrote:
>> >
>> >> Hi Stack/Ted/Pablo,
>> >>
>> >> Should we increase the hbase.rpc.timeout property to 5 minutes (
>> 300000 ms
>> >> )  ?
>> >>
>> >> Regards,
>> >> - Sreepathi
>> >>
>> >> On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:
>> >>
>> >>  That combo should be fine.
>> >>>>
>> >>> Great!!
>> >>>
>> >>>
>> >>>  If JVM is full GC'ing, the application is stopped.
>> >>>> The below does not look like a full GC but that is a long pause in
>> >>>> system
>> >>>> time, enough to kill your zk session.
>> >>>>
>> >>> Exactly. This pause is really making the zk expire the RS which
>> shutsdown
>> >>> (logs
>> >>> in the end of the email).
>> >>> But the question is: what is causing this pause??!!
>> >>>
>> >>>  You swapping?
>> >>>>
>> >>> I don't think so (stats below).
>> >>>
>> >>>  Hardware is good?
>> >>>>
>> >>> Yes, it is a 16 processor machine with 74GB of RAM and plenty disk
>> space.
>> >>> Below are some metrics I have heard about. Hope it helps.
>> >>>
>> >>>
>> >>> ** I am having some problems with the datanodes[1] which are having
>> >>> trouble to
>> >>> write. I really think the issues are related, but cannot solve any of
>> >>> them
>> >>> :(
>> >>>
>> >>> Thanks again,
>> >>> Pablo
>> >>>
>> >>> [1] http://mail-archives.apache.
>> ****org/mod_mbox/hadoop-hdfs-user/****
>> >>> 201303.mbox/%3CCAJzooYfS-F1KS+******jGOPUt15PwFjcCSzigE0APeM9FXaCr****
>> >>> [EMAIL PROTECTED]%3E<http:**//mail-archives.apache.org/**
>> >>> mod_mbox/hadoop-hdfs-user/**201303.mbox/%3CCAJzooYfS-F1KS+**
>> >>> jGOPUt15PwFjcCSzigE0APeM9FXaCr**[EMAIL PROTECTED]%3E<
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3CCAJzooYfS-F1KS+[EMAIL PROTECTED]%3E
>> >
>> >>> >
>> >>>
>> >>> top - 15:38:04 up 297 days, 21:03,  2 users,  load average: 4.34,
>> 2.55,
>> >>> 1.28
>> >>> Tasks: 528 total,   1 running, 527 sleeping,   0 stopped,   0 zombie
>> >>> Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi, 0.0%si,
>> >>>   0.0%st
>> >>> Mem:  74187256k total, 29493992k used, 44693264k free,  5836576k
>> buffers
>> >>> Swap: 51609592k total,   128312k used, 51481280k free,  1353400k
>> cached
>> >>>
>> >>> ]$ vmstat -w
>> >