Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Under Heavy Write Load + Replication On : Brings All My Region Servers Dead


Copy link to this message
-
Under Heavy Write Load + Replication On : Brings All My Region Servers Dead
I am running Hbase 0.94.2 from cloudera cdh4.2. (10 machine cluster)

Under heavy write load, and when replication is on, all my region servers
are going down.
I checked with cloudera version, it has HBASE-2611 bug patched in the
version I am using, so not sure whats going on. Here is the stack:

2013-04-18 01:47:33,423 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Atomically moving relevance-hbase5-snc1.snc1,60020,1366247910200's hlogs to
my queue

2013-04-18 01:47:33,424 DEBUG
org.apache.hadoop.hbase.replication.ReplicationZookeeper:  The multi list
size is: 1

2013-04-18 01:47:33,425 WARN
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception in
copyQueuesFromRSUsingMulti:

org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode Directory not empty

        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:125)

        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:925)

        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:901)

        at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:538)

        at
org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1457)

        at
org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705)

        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:585)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)
Followed by

2013-04-18 01:47:36,043 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
relevance-hbase2-snc1.snc1,60020,1366247745434: Writing replication status
I checked by turning replication off, and everything seems fine. I can
reproduce this bug almost every time I run my write heavy job.
Here is the complete log:

http://pastebin.com/da0m475T

Any ideas?
Ameya
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB