Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Region server crashes when using replication


Copy link to this message
-
Re: Region server crashes when using replication
I tried that, but still get the same result with the 0.90.2 build. The
worst part is that region server fails, then another one tries to take
over and also fails, until the entire cluster is down. The fact that a
replication failure such as this can cause a cascading fail on the
entire cluster is very troubling. What is the design reason for
shutting down the server for a replication error?

I also confirmed that the region server failures are not detected by
the master even after 10 minutes. I'm not sure how to show this, but I
see the servers go down and when I run status 'detailed' they are
reported as "live" forever. I have zookeeper.session.timeout
configured for 20000, which should cause it to be detected in 20
seconds.

-eran

On Tue, Mar 22, 2011 at 21:46, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
>
> You can apply the patch that I included there and that I also
> committed to the 0.90 branch.
>
> J-D
>
> On Tue, Mar 22, 2011 at 12:37 PM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> > Actually, it will probably be connection timeout, not connection
> > refused when there is no connection between the two clusters.
> >
> > Is there a workaround I can implement now for HBASE-3664, can I write
> > something in ZK so the server has an old entry to delete and is happy
> > with it?
> >
> > -eran
> >
> >
> >
> >
> > On Tue, Mar 22, 2011 at 21:01, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
> >> Inline.
> >>
> >> J-D
> >>
> >> On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> >>> Thanks, J-D.
> >>> As for the first issue, why does this behavior make sense? What happens when
> >>> the connection between the two cluster fails? Will the region servers of the
> >>> primary fail as well? or at least won't be able to start? Seems very
> >>> radical.
> >>
> >> The DNS entry should remain, so you won't get UnknownHostException but
> >> ConnectionRefused instead. But that's a different issue: HBASE-3130
> >>
> >>>
> >>> Regarding the second issue, I didn't see anything else in the logs, it just
> >>> seemed like it decided to shutdown, but maybe I missed it. I will try to
> >>> reproduce that and let you know if I succeed.
> >>
> >> That'd be nice :)
> >>
> >>>
> >>> Regarding the timeout to detect a failed server, 3 minutes sounds like a
> >>> very long time for a region server to be down. Obviously, during that time
> >>> the data owned by that server is inaccessible. Is there a reason for this
> >>> long timeout? Can it be configured?
> >>>
> >>
> >> We set it that high for people that try to push too much data to
> >> clusters that are too small / badly configured and then end up with
> >> crazy garbage collections. Have fun reading this serie of blog posts:
> >> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> >>
> >> Please also see the book about this configuration:
> >> http://hbase.apache.org/book.html#recommended_configurations
> >>
> >