Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RegionServers Crashing every hour in production env


Copy link to this message
-
Re: RegionServers Crashing every hour in production env
> 0.94 currently doesn't support hadoop 2.0
> Can you deploy hadoop 1.1.1 instead ?

I am using cdh4.2.0 which uses this version as default installation.
I think it will be a problem for me to deploy 1.1.1 because I would need to
"upgrade" the whole cluster with 70TB of data (backup everything, go offline, etc.).

Is there a problem to use cdh4.2.0?
I should send my email to cdh list?

> Are you using 0.94.5 ?

I am using 0.94.2.

> I think it is with your GC config.  What is your heap size?  What is the
> data that you pump in and how much is the block cache size?

#JVM config:
export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m -XX:+UseConcMarkSweepGC -XX:MaxDirectMemorySize=2G -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/var/logs/hbase/gc-hbase.log"

# heap size
export HBASE_HEAPSIZE=8192

#hbase metrics
requestsPerSecond=8, numberOfOnlineRegions=1252, numberOfStores=1272, numberOfStorefiles=1651, storefileIndexSizeMB=66, rootIndexSizeKB=68176, totalStaticIndexSizeKB=55028, totalStaticBloomSizeKB=0, memstoreSizeMB=3, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=1176287, writeRequestsCount=2165, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=328, maxHeapMB=8185, blockCacheSizeMB=117.94, blockCacheFreeMB=1928.47, blockCacheCount=2083, blockCacheHitCount=34815, blockCacheMissCount=10259, blockCacheEvictedCount=17, blockCacheHitRatio=77%, blockCacheHitCachingRatio=94%, hdfsBlocksLocalityIndex=65, slowHLogAppendCount=0, fsReadLatencyHistogramMean=0, fsReadLatencyHistogramCount=0, fsReadLatencyHistogramMedian=0, fsReadLatencyHistogram75th=0, fsReadLatencyHistogram95th=0, fsReadLatencyHistogram99th=0, fsReadLatencyHistogram999th=0, fsPreadLatencyHistogramMean=0, fsPreadLatencyHistogramCount=0, fsPreadLatencyHistogramMedian=0, fsPreadLatencyHistogram75th=0, fsPreadLatencyHistogram95th=0, fsPreadLatencyHistogram99th=0, fsPreadLatencyHistogram999th=0, fsWriteLatencyHistogramMean=0, fsWriteLatencyHistogramCount=0, fsWriteLatencyHistogramMedian=0, fsWriteLatencyHistogram75th=0, fsWriteLatencyHistogram95th=0, fsWriteLatencyHistogram99th=0, fsWriteLatencyHistogram999th=0

#hbase-site.xml
   <property>
       <name>hbase.hregion.memstore.mslab.enabled</name>
       <value>true</value>
   </property>
   <property>
       <name>hbase.regionserver.handler.count</name>
       <value>20</value>
   </property>

All the other parameters I am using are default, both hbase and hadoop.

Four tables with this same configuration.
{NAME => 'T1', FAMILIES => [{NAME => 'details', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

Rows from one table can vary from 4kb to 50kb while rows from the other 3
usually vary from 60 bytes to 300 bytes.

> You Full GC'ing around this time?

The GC shows it took a long time. However it does not make any sense
to be it, since the same ammount of data was cleaned before and AFTER
in just 0.01 secs!

[Times: user=0.08 sys=137.62, real=137.62 secs]

Besides the whole time was used by system. That is what is bugging me.

  ...

1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
real=0.01 secs]

1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
real=0.00 secs]

1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
sys=137.62, real=137.62 secs]

1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]

1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
real=0.01 secs]
I really appreciate you guys helping me to find out what is wrong.

Thanks,
Pablo
On 03/08/2013 02:11 PM, Stack wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB