|
|
-
Re: Datanodes shutdown and HBase's regionservers not workingNicolas Liochon 2013-02-25, 10:07
I agree.
Then for HDFS, ... The first thing to check is the network I would say. On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <[EMAIL PROTECTED]> wrote: > Thanks for reply, Nicolas. > > My question: What can lead to shutdown of all of the datanodes? > I believe that the regionservers will be OK if the HDFS is OK. > > > On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <[EMAIL PROTECTED]> > wrote: > > Ok, what's your question? > > When you say the datanode went down, was it the datanode processes or the > > machines, with both the datanodes and the regionservers? > > > > The NameNode pings its datanodes every 3 seconds. However it will > internally > > mark the datanodes as dead after 10:30 minutes (even if in the gui you > have > > 'no answer for x minutes'). > > HBase monitoring is done by ZooKeeper. By default, a regionserver is > > considered as dead after 180s with no answer. Before, well, it's > considered > > as live. > > When you stop a regionserver, it tries to flush its data to the disk > (i.e. > > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if a > high > > ratio of your datanodes are dead, it can't shutdown. Connection refused & > > socket timeouts come from the fact that before the 10:30 minutes hdfs > does > > not declare the nodes as dead, so hbase tries to use them (and, > obviously, > > fails). Note that there is now an intermediate state for hdfs datanodes, > > called "stale": an intermediary state where the datanode is used only if > you > > have to (i.e. it's the only datanode with a block replica you need). It > will > > be documented in HBase for the 0.96 release. But if all your datanodes > are > > down it won't change much. > > > > Cheers, > > > > Nicolas > > > > > > > > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <[EMAIL PROTECTED]> wrote: > >> > >> Hey guys, > >> > >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than 1 > >> year, and it works fine. > >> But the datanodes got shutdown twice in the last month. > >> > >> When the datanodes got shutdown, all of them became "Dead Nodes" in > >> the NN web admin UI(http://ip:50070/dfshealth.jsp), > >> but regionservers of HBase were still live in the HBase web > >> admin(http://ip:60010/master-status), of course, they were zombies. > >> All of the processes of jvm were still running, including > >> hmaster/namenode/regionserver/datanode. > >> > >> When the datanodes got shutdown, the load (using the "top" command) of > >> slaves became very high, more than 10, higher than normal running. > >> From the "top" command, we saw that the processes of datanode and > >> regionserver were comsuming CPU. > >> > >> We could not stop the HBase or Hadoop cluster through normal > >> commands(stop-*.sh/*-daemon.sh stop *). > >> So we stopped datanodes and regionservers by kill -9 PID, then the > >> load of slaves returned to normal level, and we start the cluster > >> again. > >> > >> > >> Log of NN at the shutdown point(All of the DNs were removed): > >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology: > >> Removing a node: /default-rack/192.168.1.152:50010 > >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange: > >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from > >> 192.168.1.149:50010 > >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology: > >> Removing a node: /default-rack/192.168.1.149:50010 > >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange: > >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from > >> 192.168.1.150:50010 > >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology: > >> Removing a node: /default-rack/192.168.1.150:50010 > >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange: > >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from > >> 192.168.1.148:50010 > >> 2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology: > >> Removing a node: /default-rack/192.168.1.148:50010 > >> > >> > >> Logs in DNs indicated there were many IOException and |