|
|
-
Cascading failure leads to loss of all region servers
Bryan Beaudreault 2012-04-11, 17:24
We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2, hosting about 17k regions. Each region server has 10GB of heap, and in normal operating levels I have never seen our used heap go above 5-8GB. Yesterday we were running a job to populate a new table, and this resulted in a cascading OOM failure which ended with all region servers being down.
The failure on each node went something like this (region A is the same region across all servers, getting passed along as each dies): 1. RS inherits region A. 2. RS tries to flush region A, but the region has "too many store files". RS delays flush and instead runs a compaction 3. 1 minute pause in the logs (could have been a GC, the logs pretty steadily were coming in every 1-2 seconds), results in lost connection to ZK 4. RS reconnects to ZK and blocks updates on region A, due to memstore too big (129.8m is > 128m blocking size) 5. Another 30 second pause (another GC?) 6. Lost connection to server from master 7. 1-2 minutes later, aborts the compaction and throws OutOfMemoryError: Java heap space. The exception comes from the compaction (pasted below).
That pattern repeats on all of the region servers, every 5-8 minutes until all are down. Should there be some safeguards on a compaction causing a region server to go OOM? The region appears to only be around 425mb in size. --- (This exception comes from the first regionserver to go down. The others were very similar, with the same stacktrace.) 12/04/10 08:04:57 INFO regionserver.HRegion: aborted compaction on region analytics-search-a2,\x00\x00^\xF4\x00\x0A0,1334056032860.ab55c22574a9cddec8a3e73fd99be57d. after 14mins, 34sec 12/04/10 08:04:57 FATAL regionserver.HRegionServer: ABORTING region server serverName=XXXXXXXX,60020,1326728856867, load=(requests=65, regions=1082, usedHeap=10226, maxHeap=10231): Uncaught exception in service thread regionserver60020.compactor java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:43) at org.apache.hadoop.io.compress.BlockDecompressorStream.<init>(BlockDecompressorStream.java:45) at com.hadoop.compression.lzo.LzoCodec.createInputStream(LzoCodec.java:173) at org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:206) at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1087) at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1280) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326) at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:943) at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:743) at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:808) at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:748) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
-
Re: Cascading failure leads to loss of all region servers
Stack 2012-04-11, 20:35
On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault <[EMAIL PROTECTED]> wrote: > We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2, > hosting about 17k regions.
Thats too many but thats another story.
> That pattern repeats on all of the region servers, every 5-8 minutes until > all are down. Should there be some safeguards on a compaction causing a > region server to go OOM? The region appears to only be around 425mb in > size. >
My guess is that Region A has a massive or corrupt record in it.
You could disable the region for now while you are figuring whats wrong w/it.
If you list files under this region, what do you see? Are there many?
Can you see what files are selected for compaction? This will narrow the set to look at. You could poke at them w/ the hfile tool. See '8.7.5.2.2. HFile Tool' in the reference guide.
St.Ack
-
Re: Cascading failure leads to loss of all region servers
Bryan Beaudreault 2012-04-12, 01:17
Hi Stack,
Thanks for the reply. Unfortunately, our first instinct was to restart the region servers and when they came up it appears the compaction was able to succeed (perhaps because on a fresh restart the heap was low enough to succeed). I listed the files under that region and there is now only 1 file.
We are going to be running this job again in the near future. We are going to try to rate limit the writes a bit (though only 10 reducers were running at once to begin with), and I will keep in mind your suggestions if it happens despite that.
- Bryan
On Wed, Apr 11, 2012 at 4:35 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault > <[EMAIL PROTECTED]> wrote: > > We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2, > > hosting about 17k regions. > > Thats too many but thats another story. > > > That pattern repeats on all of the region servers, every 5-8 minutes > until > > all are down. Should there be some safeguards on a compaction causing a > > region server to go OOM? The region appears to only be around 425mb in > > size. > > > > My guess is that Region A has a massive or corrupt record in it. > > You could disable the region for now while you are figuring whats wrong > w/it. > > If you list files under this region, what do you see? Are there many? > > Can you see what files are selected for compaction? This will narrow > the set to look at. You could poke at them w/ the hfile tool. See > '8.7.5.2.2. HFile Tool' in the reference guide. > > St.Ack >
-
Re: Cascading failure leads to loss of all region servers
Andrew Purtell 2012-04-12, 05:58
One idea we took from the 0.89-FB branch is setting the internal scanner read batching for compaction (compactionKVMax) to 1 as there isn't a benefit otherwise server side for compaction and we run with heaps sometimes up at 90% utilization for a time as observed with JMX. Wonder if that would have had an impact here. Just a random thought, pardon if the default is 1 (IIRC it's 10) or something silly like that.
Best regards,
- Andy On Apr 11, 2012, at 6:17 PM, Bryan Beaudreault <[EMAIL PROTECTED]> wrote:
> Hi Stack, > > Thanks for the reply. Unfortunately, our first instinct was to restart the > region servers and when they came up it appears the compaction was able to > succeed (perhaps because on a fresh restart the heap was low enough to > succeed). I listed the files under that region and there is now only 1 > file. > > We are going to be running this job again in the near future. We are going > to try to rate limit the writes a bit (though only 10 reducers were running > at once to begin with), and I will keep in mind your suggestions if it > happens despite that. > > - Bryan > > On Wed, Apr 11, 2012 at 4:35 PM, Stack <[EMAIL PROTECTED]> wrote: > >> On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault >> <[EMAIL PROTECTED]> wrote: >>> We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2, >>> hosting about 17k regions. >> >> Thats too many but thats another story. >> >>> That pattern repeats on all of the region servers, every 5-8 minutes >> until >>> all are down. Should there be some safeguards on a compaction causing a >>> region server to go OOM? The region appears to only be around 425mb in >>> size. >>> >> >> My guess is that Region A has a massive or corrupt record in it. >> >> You could disable the region for now while you are figuring whats wrong >> w/it. >> >> If you list files under this region, what do you see? Are there many? >> >> Can you see what files are selected for compaction? This will narrow >> the set to look at. You could poke at them w/ the hfile tool. See >> '8.7.5.2.2. HFile Tool' in the reference guide. >> >> St.Ack >>
|
|