Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> What is the right way to perform a cluster restart?


Copy link to this message
-
What is the right way to perform a cluster restart?
Hi Users,

We are running following code.

Hbase version : 0.90.3 with HBASE-3777, HBASE-2937 and HBASE-3855 on top
Hadoop version: CDH3B3

I am trying to figure the right way to perform cluster restart in case
we want to push a patched jar or a configuration tweak. I have tried
http://wiki.apache.org/hadoop/Hbase/RollingRestart among
other things (reverse the order of process restart as described in
rolling restart). But we always end up facing following error.

We can not use the rolling restart script as ssh for user running
hbase  is not configured right. I have tried emulating the steps in
the script. It didn't help.
Its fairly easy to reproduce on production cluster. I have not been
able to reproduce on staging instance though. The data on staging is
much less, hence recovery might be taking fairly small time - just a
guess.
There was nothing interesting going on in logs of RS running at
bond0.ine-46.dummy.net. The way we move pass this situation is by
killing the RS causing contention.

How other are handling the cluster restart?
2011-06-28 21:25:41,439 INFO
org.apache.hadoop.hbase.master.ServerManager: Registering
server=bond0.ine-52.dummy.net,60020,1309310584702, regionCount=0,
userLoad=true
2011-06-28 21:25:41,445 INFO
org.apache.hadoop.hbase.master.ServerManager: Registering
server=bond0.ine-45.dummy.net,60020,1309310343488, regionCount=0,
userLoad=true
2011-06-28 21:25:41,471 INFO
org.apache.hadoop.hbase.master.ServerManager: Registering
server=bond0.ine-47.dummy.net,60020,1309309545442, regionCount=0,
userLoad=true
2011-06-28 21:25:42,210 INFO
org.apache.hadoop.hbase.master.ServerManager: Waiting on
regionserver(s) count to settle; currently=4
2011-06-28 21:25:43,712 INFO
org.apache.hadoop.hbase.master.ServerManager: Finished waiting for
regionserver count to settle; count=4, sleptFor=4500
2011-06-28 21:25:43,712 INFO
org.apache.hadoop.hbase.master.ServerManager: Exiting wait on
regionserver(s) to checkin; count=4, stopped=false, count of regions
out on cluster=0
2011-06-28 21:25:43,718 INFO
org.apache.hadoop.hbase.master.MasterFileSystem: Log folder
hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-45.dummy.net,60020,1309310343488
belongs to an existing region server
2011-06-28 21:25:43,719 INFO
org.apache.hadoop.hbase.master.MasterFileSystem: Log folder
hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732
doesn't belong to a known region server, splitting
2011-06-28 21:25:43,730 INFO
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1
hlog(s) in hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732
2011-06-28 21:25:43,735 INFO org.apache.hadoop.hbase.util.FSUtils:
Recovering file
hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732/bond0.ine-46.dummy.net%3A60020.1309310351513
2011-06-28 21:26:43,976 WARN org.apache.hadoop.hbase.util.FSUtils:
Waited 60241ms for lease recovery on
hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732/bond0.ine-46.dummy.net%3A60020.1309310351513:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
failed to create file
/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732/bond0.ine-46.dummy.net%3A60020.1309310351513
for DFSClient_hb_m_bond0.ine-54.dummy.net:60000_1309310737997 on
client 172.22.2.54, because this file is already being created by
DFSClient_hb_rs_bond0.ine-46.dummy.net,60020,1309310350732_1309310351430
on 172.22.2.46
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282)
at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315)
at java.security.AccessController.doPrivileged(Native Method)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB