Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Cascading failure leads to loss of all region servers


Copy link to this message
-
Re: Cascading failure leads to loss of all region servers
Bryan Beaudreault 2012-04-12, 01:17
Hi Stack,

Thanks for the reply.  Unfortunately, our first instinct was to restart the
region servers and when they came up it appears the compaction was able to
succeed (perhaps because on a fresh restart the heap was low enough to
succeed).  I listed the files under that region and there is now only 1
file.

We are going to be running this job again in the near future.  We are going
to try to rate limit the writes a bit (though only 10 reducers were running
at once to begin with), and I will keep in mind your suggestions if it
happens despite that.

- Bryan

On Wed, Apr 11, 2012 at 4:35 PM, Stack <[EMAIL PROTECTED]> wrote:

> On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault
> <[EMAIL PROTECTED]> wrote:
> > We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
> > hosting about 17k regions.
>
> Thats too many but thats another story.
>
> > That pattern repeats on all of the region servers, every 5-8 minutes
> until
> > all are down. Should there be some safeguards on a compaction causing a
> > region server to go OOM?  The region appears to only be around 425mb in
> > size.
> >
>
> My guess is that Region A has a massive or corrupt record in it.
>
> You could disable the region for now while you are figuring whats wrong
> w/it.
>
> If you list files under this region, what do you see?  Are there many?
>
> Can you see what files are selected for compaction?  This will narrow
> the set to look at.  You could poke at them w/ the hfile tool.  See
> '8.7.5.2.2. HFile Tool' in the reference guide.
>
> St.Ack
>