-Re: Cascading failure leads to loss of all region servers
Bryan Beaudreault 2012-04-12, 01:17
Thanks for the reply. Unfortunately, our first instinct was to restart the
region servers and when they came up it appears the compaction was able to
succeed (perhaps because on a fresh restart the heap was low enough to
succeed). I listed the files under that region and there is now only 1
We are going to be running this job again in the near future. We are going
to try to rate limit the writes a bit (though only 10 reducers were running
at once to begin with), and I will keep in mind your suggestions if it
happens despite that.
On Wed, Apr 11, 2012 at 4:35 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault
> <[EMAIL PROTECTED]> wrote:
> > We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
> > hosting about 17k regions.
> Thats too many but thats another story.
> > That pattern repeats on all of the region servers, every 5-8 minutes
> > all are down. Should there be some safeguards on a compaction causing a
> > region server to go OOM? The region appears to only be around 425mb in
> > size.
> My guess is that Region A has a massive or corrupt record in it.
> You could disable the region for now while you are figuring whats wrong
> If you list files under this region, what do you see? Are there many?
> Can you see what files are selected for compaction? This will narrow
> the set to look at. You could poke at them w/ the hfile tool. See
> '126.96.36.199.2. HFile Tool' in the reference guide.