Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Cascading failure leads to loss of all region servers


Copy link to this message
-
Cascading failure leads to loss of all region servers
We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
hosting about 17k regions.  Each region server has 10GB of heap, and in
normal operating levels I have never seen our used heap go above 5-8GB.
 Yesterday we were running a job to populate a new table, and this resulted
in a cascading OOM failure which ended with all region servers being down.

The failure on each node went something like this (region A is the same
region across all servers, getting passed along as each dies):
   1. RS inherits region A.
   2. RS tries to flush region A, but the region has "too many store
   files".  RS delays flush and instead runs a compaction
   3. 1 minute pause in the logs (could have been a GC, the logs pretty
   steadily were coming in every 1-2 seconds), results in lost connection to ZK
   4. RS reconnects to ZK and blocks updates on region A, due to memstore
   too big (129.8m is > 128m blocking size)
   5. Another 30 second pause (another GC?)
   6. Lost connection to server from master
   7. 1-2 minutes later, aborts the compaction and throws OutOfMemoryError:
   Java heap space.  The exception comes from the compaction (pasted below).

That pattern repeats on all of the region servers, every 5-8 minutes until
all are down. Should there be some safeguards on a compaction causing a
region server to go OOM?  The region appears to only be around 425mb in
size.
---
(This exception comes from the first regionserver to go down.  The others
were very similar, with the same stacktrace.)
12/04/10 08:04:57 INFO regionserver.HRegion: aborted compaction on region
analytics-search-a2,\x00\x00^\xF4\x00\x0A0,1334056032860.ab55c22574a9cddec8a3e73fd99be57d.
after 14mins, 34sec
12/04/10 08:04:57 FATAL regionserver.HRegionServer: ABORTING region server
serverName=XXXXXXXX,60020,1326728856867, load=(requests=65, regions=1082,
usedHeap=10226, maxHeap=10231): Uncaught exception in service thread
regionserver60020.compactor
java.lang.OutOfMemoryError: Java heap space
        at
org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:43)
        at
org.apache.hadoop.io.compress.BlockDecompressorStream.<init>(BlockDecompressorStream.java:45)
        at
com.hadoop.compression.lzo.LzoCodec.createInputStream(LzoCodec.java:173)
        at
org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.createDecompressionStream(Compression.java:206)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1087)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1280)
        at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
        at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:943)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:743)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:808)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:748)
        at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)