Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - RegionServers Crashing every hour in production env


+
Pablo Musa 2013-03-08, 15:44
+
Ted Yu 2013-03-08, 16:01
+
ramkrishna vasudevan 2013-03-08, 16:32
+
Stack 2013-03-08, 17:11
+
Pablo Musa 2013-03-08, 18:58
+
Stack 2013-03-08, 22:02
+
Pablo Musa 2013-03-10, 18:59
+
Sreepathi 2013-03-10, 19:06
+
Pablo Musa 2013-03-10, 22:29
+
Stack 2013-03-10, 22:41
+
Azuryy Yu 2013-03-11, 02:13
Copy link to this message
-
Re: RegionServers Crashing every hour in production env
Andrew Purtell 2013-03-11, 02:24
Be careful with GC tuning, throwing changes at an application without
analysis of what is going on with the heap is shooting in the dark. One
particular good treatment of the subject is here:
http://java.dzone.com/articles/how-tame-java-gc-pauses

If you have made custom changes to blockcache or memstore configurations,
back them out until you're sure everything else is ok.

Watch carefully for swapping. Set the vm.swappiness sysctl to 0. Monitor
for spikes in page scanning or any swap activity. Nothing brings on
"Juliette" pauses better than a JVM partially swapped out. The Java GC
starts collection by examining the oldest pages, and those are the first
pages the OS swaps out...

On Mon, Mar 11, 2013 at 10:13 AM, Azuryy Yu <[EMAIL PROTECTED]> wrote:

> Hi Pablo,
> It'a terrible for a long minor GC. I don't think there are swaping from
> your vmstat log.
> but I just suggest you
> 1) add following JVM options:
> -XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:GCTimeRatio=19
> -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2
> -XX:MaxTenuringThreshold=3 -XX:+UseFastAccessorMethods
>
> 2) -Xmn is two small, your total Mem is 74GB, just make -Xmn2g
> 3) what are you doing during long GC happened? read or write? if reading,
> what the block cache size?
>
>
>
>
> On Mon, Mar 11, 2013 at 6:41 AM, Stack <[EMAIL PROTECTED]> wrote:
>
> > You could increase your zookeeper session timeout to 5 minutes while you
> > are figuring why these long pauses.
> > http://hbase.apache.org/book.html#zookeeper.session.timeout
> >
> > Above, there is an outage for almost 5 minutes:
> >
> > >> We slept 225100ms instead of 3000ms, this is likely due to a long
> >
> > You have ganglia or tsdb running?  When you see the big pause above, can
> > you see anything going on on the machine?  (swap, iowait, concurrent fat
> > mapreduce job?)
> >
> > St.Ack
> >
> >
> >
> > On Sun, Mar 10, 2013 at 3:29 PM, Pablo Musa <[EMAIL PROTECTED]> wrote:
> >
> > > Hi Sreepathi,
> > > they say in the book (or the site), we could try it to see if it is
> > really
> > > a timeout error
> > > or there is something more. But it is not recomended for production
> > > environments.
> > >
> > > I could give it a try if five minutes will ensure to us that the
> problem
> > > is the GC or
> > > elsewhere!! Anyway, I think it is hard to beleive a GC is taking 2:30
> > > minutes.
> > >
> > > Abs,
> > > Pablo
> > >
> > >
> > > On 03/10/2013 04:06 PM, Sreepathi wrote:
> > >
> > >> Hi Stack/Ted/Pablo,
> > >>
> > >> Should we increase the hbase.rpc.timeout property to 5 minutes (
> 300000
> > ms
> > >> )  ?
> > >>
> > >> Regards,
> > >> - Sreepathi
> > >>
> > >> On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:
> > >>
> > >>  That combo should be fine.
> > >>>>
> > >>> Great!!
> > >>>
> > >>>
> > >>>  If JVM is full GC'ing, the application is stopped.
> > >>>> The below does not look like a full GC but that is a long pause in
> > >>>> system
> > >>>> time, enough to kill your zk session.
> > >>>>
> > >>> Exactly. This pause is really making the zk expire the RS which
> > shutsdown
> > >>> (logs
> > >>> in the end of the email).
> > >>> But the question is: what is causing this pause??!!
> > >>>
> > >>>  You swapping?
> > >>>>
> > >>> I don't think so (stats below).
> > >>>
> > >>>  Hardware is good?
> > >>>>
> > >>> Yes, it is a 16 processor machine with 74GB of RAM and plenty disk
> > space.
> > >>> Below are some metrics I have heard about. Hope it helps.
> > >>>
> > >>>
> > >>> ** I am having some problems with the datanodes[1] which are having
> > >>> trouble to
> > >>> write. I really think the issues are related, but cannot solve any of
> > >>> them
> > >>> :(
> > >>>
> > >>> Thanks again,
> > >>> Pablo
> > >>>
> > >>> [1] http://mail-archives.apache.
> ****org/mod_mbox/hadoop-hdfs-user/****
> > >>>
> 201303.mbox/%3CCAJzooYfS-F1KS+******jGOPUt15PwFjcCSzigE0APeM9FXaCr****
> > >>> [EMAIL PROTECTED]%3E<http:**//mail-archives.apache.org/**
> > >>> mod_mbox/hadoop-hdfs-user/**201303.mbox/%3CCAJzooYfS-F1KS+**

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
+
Pablo Musa 2013-03-12, 15:43
+
Pablo Musa 2013-04-03, 18:21
+
Ted Yu 2013-04-03, 18:36
+
Pablo Musa 2013-04-03, 20:24
+
Ted Yu 2013-04-03, 21:40
+
Azuryy Yu 2013-03-11, 02:14