Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - RegionServers Crashing every hour in production env


Copy link to this message
-
Re: RegionServers Crashing every hour in production env
Pablo Musa 2013-03-08, 18:58
> 0.94 currently doesn't support hadoop 2.0
> Can you deploy hadoop 1.1.1 instead ?

I am using cdh4.2.0 which uses this version as default installation.
I think it will be a problem for me to deploy 1.1.1 because I would need to
"upgrade" the whole cluster with 70TB of data (backup everything, go offline, etc.).

Is there a problem to use cdh4.2.0?
I should send my email to cdh list?

> Are you using 0.94.5 ?

I am using 0.94.2.

> I think it is with your GC config.  What is your heap size?  What is the
> data that you pump in and how much is the block cache size?

#JVM config:
export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m -XX:+UseConcMarkSweepGC -XX:MaxDirectMemorySize=2G -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/var/logs/hbase/gc-hbase.log"

# heap size
export HBASE_HEAPSIZE=8192

#hbase metrics
requestsPerSecond=8, numberOfOnlineRegions=1252, numberOfStores=1272, numberOfStorefiles=1651, storefileIndexSizeMB=66, rootIndexSizeKB=68176, totalStaticIndexSizeKB=55028, totalStaticBloomSizeKB=0, memstoreSizeMB=3, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=1176287, writeRequestsCount=2165, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=328, maxHeapMB=8185, blockCacheSizeMB=117.94, blockCacheFreeMB=1928.47, blockCacheCount=2083, blockCacheHitCount=34815, blockCacheMissCount=10259, blockCacheEvictedCount=17, blockCacheHitRatio=77%, blockCacheHitCachingRatio=94%, hdfsBlocksLocalityIndex=65, slowHLogAppendCount=0, fsReadLatencyHistogramMean=0, fsReadLatencyHistogramCount=0, fsReadLatencyHistogramMedian=0, fsReadLatencyHistogram75th=0, fsReadLatencyHistogram95th=0, fsReadLatencyHistogram99th=0, fsReadLatencyHistogram999th=0, fsPreadLatencyHistogramMean=0, fsPreadLatencyHistogramCount=0, fsPreadLatencyHistogramMedian=0, fsPreadLatencyHistogram75th=0, fsPreadLatencyHistogram95th=0, fsPreadLatencyHistogram99th=0, fsPreadLatencyHistogram999th=0, fsWriteLatencyHistogramMean=0, fsWriteLatencyHistogramCount=0, fsWriteLatencyHistogramMedian=0, fsWriteLatencyHistogram75th=0, fsWriteLatencyHistogram95th=0, fsWriteLatencyHistogram99th=0, fsWriteLatencyHistogram999th=0

#hbase-site.xml
   <property>
       <name>hbase.hregion.memstore.mslab.enabled</name>
       <value>true</value>
   </property>
   <property>
       <name>hbase.regionserver.handler.count</name>
       <value>20</value>
   </property>

All the other parameters I am using are default, both hbase and hadoop.

Four tables with this same configuration.
{NAME => 'T1', FAMILIES => [{NAME => 'details', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

Rows from one table can vary from 4kb to 50kb while rows from the other 3
usually vary from 60 bytes to 300 bytes.

> You Full GC'ing around this time?

The GC shows it took a long time. However it does not make any sense
to be it, since the same ammount of data was cleaned before and AFTER
in just 0.01 secs!

[Times: user=0.08 sys=137.62, real=137.62 secs]

Besides the whole time was used by system. That is what is bugging me.

  ...

1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
real=0.01 secs]

1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
real=0.00 secs]

1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
sys=137.62, real=137.62 secs]

1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]

1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
real=0.01 secs]
I really appreciate you guys helping me to find out what is wrong.

Thanks,
Pablo
On 03/08/2013 02:11 PM, Stack wrote: