Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RegionServers Crashing every hour in production env


Copy link to this message
-
Re: RegionServers Crashing every hour in production env
Hi Pablo,
It'a terrible for a long minor GC. I don't think there are swaping from
your vmstat log.
but I just suggest you
1) add following JVM options:
-XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:GCTimeRatio=19
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2
-XX:MaxTenuringThreshold=3 -XX:+UseFastAccessorMethods

2) -Xmn is two small, your total Mem is 74GB, just make -Xmn2g
3) what are you doing during long GC happened? read or write? if reading,
what the block cache size?
On Mon, Mar 11, 2013 at 6:41 AM, Stack <[EMAIL PROTECTED]> wrote:

> You could increase your zookeeper session timeout to 5 minutes while you
> are figuring why these long pauses.
> http://hbase.apache.org/book.html#zookeeper.session.timeout
>
> Above, there is an outage for almost 5 minutes:
>
> >> We slept 225100ms instead of 3000ms, this is likely due to a long
>
> You have ganglia or tsdb running?  When you see the big pause above, can
> you see anything going on on the machine?  (swap, iowait, concurrent fat
> mapreduce job?)
>
> St.Ack
>
>
>
> On Sun, Mar 10, 2013 at 3:29 PM, Pablo Musa <[EMAIL PROTECTED]> wrote:
>
> > Hi Sreepathi,
> > they say in the book (or the site), we could try it to see if it is
> really
> > a timeout error
> > or there is something more. But it is not recomended for production
> > environments.
> >
> > I could give it a try if five minutes will ensure to us that the problem
> > is the GC or
> > elsewhere!! Anyway, I think it is hard to beleive a GC is taking 2:30
> > minutes.
> >
> > Abs,
> > Pablo
> >
> >
> > On 03/10/2013 04:06 PM, Sreepathi wrote:
> >
> >> Hi Stack/Ted/Pablo,
> >>
> >> Should we increase the hbase.rpc.timeout property to 5 minutes ( 300000
> ms
> >> )  ?
> >>
> >> Regards,
> >> - Sreepathi
> >>
> >> On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa <[EMAIL PROTECTED]> wrote:
> >>
> >>  That combo should be fine.
> >>>>
> >>> Great!!
> >>>
> >>>
> >>>  If JVM is full GC'ing, the application is stopped.
> >>>> The below does not look like a full GC but that is a long pause in
> >>>> system
> >>>> time, enough to kill your zk session.
> >>>>
> >>> Exactly. This pause is really making the zk expire the RS which
> shutsdown
> >>> (logs
> >>> in the end of the email).
> >>> But the question is: what is causing this pause??!!
> >>>
> >>>  You swapping?
> >>>>
> >>> I don't think so (stats below).
> >>>
> >>>  Hardware is good?
> >>>>
> >>> Yes, it is a 16 processor machine with 74GB of RAM and plenty disk
> space.
> >>> Below are some metrics I have heard about. Hope it helps.
> >>>
> >>>
> >>> ** I am having some problems with the datanodes[1] which are having
> >>> trouble to
> >>> write. I really think the issues are related, but cannot solve any of
> >>> them
> >>> :(
> >>>
> >>> Thanks again,
> >>> Pablo
> >>>
> >>> [1] http://mail-archives.apache.****org/mod_mbox/hadoop-hdfs-user/****
> >>> 201303.mbox/%3CCAJzooYfS-F1KS+******jGOPUt15PwFjcCSzigE0APeM9FXaCr****
> >>> [EMAIL PROTECTED]%3E<http:**//mail-archives.apache.org/**
> >>> mod_mbox/hadoop-hdfs-user/**201303.mbox/%3CCAJzooYfS-F1KS+**
> >>> jGOPUt15PwFjcCSzigE0APeM9FXaCr**[EMAIL PROTECTED]%3E<
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3CCAJzooYfS-F1KS+[EMAIL PROTECTED]%3E
> >
> >>> >
> >>>
> >>> top - 15:38:04 up 297 days, 21:03,  2 users,  load average: 4.34, 2.55,
> >>> 1.28
> >>> Tasks: 528 total,   1 running, 527 sleeping,   0 stopped,   0 zombie
> >>> Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi, 0.0%si,
> >>>   0.0%st
> >>> Mem:  74187256k total, 29493992k used, 44693264k free,  5836576k
> buffers
> >>> Swap: 51609592k total,   128312k used, 51481280k free,  1353400k cached
> >>>
> >>> ]$ vmstat -w
> >>> procs -------------------memory-----****------------- ---swap--
> >>> -----io----
> >>> --system-- -----cpu-------
> >>>   r  b       swpd       free       buff      cache   si   so    bi bo
> >>> in
> >>>    cs  us sy  id wa st
> >>>   2  0     128312   32416928    5838288    5043560    0    0   202 53