Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Re: Datanodes shutdown and HBase's regionservers not working


Copy link to this message
-
Re: Datanodes shutdown and HBase's regionservers not working
Nicolas Liochon 2013-02-25, 09:31
Ok, what's your question?
When you say the datanode went down, was it the datanode processes or the
machines, with both the datanodes and the regionservers?

The NameNode pings its datanodes every 3 seconds. However it will
internally mark the datanodes as dead after 10:30 minutes (even if in the
gui you have 'no answer for x minutes').
HBase monitoring is done by ZooKeeper. By default, a regionserver is
considered as dead after 180s with no answer. Before, well, it's considered
as live.
When you stop a regionserver, it tries to flush its data to the disk (i.e.
hdfs, i.e. the datanodes). That's why if you have no datanodes, or if a
high ratio of your datanodes are dead, it can't shutdown. Connection
refused & socket timeouts come from the fact that before the 10:30 minutes
hdfs does not declare the nodes as dead, so hbase tries to use them (and,
obviously, fails). Note that there is now  an intermediate state for hdfs
datanodes, called "stale": an intermediary state where the datanode is used
only if you have to (i.e. it's the only datanode with a block replica you
need). It will be documented in HBase for the 0.96 release. But if all your
datanodes are down it won't change much.

Cheers,

Nicolas

On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <[EMAIL PROTECTED]> wrote:

> Hey guys,
>
> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than 1
> year, and it works fine.
> But the datanodes got shutdown twice in the last month.
>
> When the datanodes got shutdown, all of them became "Dead Nodes" in
> the NN web admin UI(http://ip:50070/dfshealth.jsp),
> but regionservers of HBase were still live in the HBase web
> admin(http://ip:60010/master-status), of course, they were zombies.
> All of the processes of jvm were still running, including
> hmaster/namenode/regionserver/datanode.
>
> When the datanodes got shutdown, the load (using the "top" command) of
> slaves became very high, more than 10, higher than normal running.
> From the "top" command, we saw that the processes of datanode and
> regionserver were comsuming CPU.
>
> We could not stop the HBase or Hadoop cluster through normal
> commands(stop-*.sh/*-daemon.sh stop *).
> So we stopped datanodes and regionservers by kill -9 PID, then the
> load of slaves returned to normal level, and we start the cluster
> again.
>
>
> Log of NN at the shutdown point(All of the DNs were removed):
> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology:
> Removing a node: /default-rack/192.168.1.152:50010
> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> 192.168.1.149:50010
> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology:
> Removing a node: /default-rack/192.168.1.149:50010
> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> 192.168.1.150:50010
> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology:
> Removing a node: /default-rack/192.168.1.150:50010
> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> 192.168.1.148:50010
> 2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology:
> Removing a node: /default-rack/192.168.1.148:50010
>
>
> Logs in DNs indicated there were many IOException and
> SocketTimeoutException:
> 2013-02-22 11:02:52,354 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(192.168.1.148:50010,
> storageID=DS-970284113-117.25.149.160-50010-1328074119937,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Interrupted receiveBlock
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>         at java.lang.Thread.run(Thread.java:662)
+
Davey Yan 2013-02-25, 09:46
+
Nicolas Liochon 2013-02-25, 10:07