Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Datanodes shutdown and HBase's regionservers not working


+
Nicolas Liochon 2013-02-25, 09:31
+
Davey Yan 2013-02-25, 09:46
Copy link to this message
-
Re: Datanodes shutdown and HBase's regionservers not working
I agree.
Then for HDFS, ...
The first thing to check is the network I would say.
On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <[EMAIL PROTECTED]> wrote:

> Thanks for reply, Nicolas.
>
> My question: What can lead to shutdown of all of the datanodes?
> I believe that the regionservers will be OK if the HDFS is OK.
>
>
> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <[EMAIL PROTECTED]>
> wrote:
> > Ok, what's your question?
> > When you say the datanode went down, was it the datanode processes or the
> > machines, with both the datanodes and the regionservers?
> >
> > The NameNode pings its datanodes every 3 seconds. However it will
> internally
> > mark the datanodes as dead after 10:30 minutes (even if in the gui you
> have
> > 'no answer for x minutes').
> > HBase monitoring is done by ZooKeeper. By default, a regionserver is
> > considered as dead after 180s with no answer. Before, well, it's
> considered
> > as live.
> > When you stop a regionserver, it tries to flush its data to the disk
> (i.e.
> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if a
> high
> > ratio of your datanodes are dead, it can't shutdown. Connection refused &
> > socket timeouts come from the fact that before the 10:30 minutes hdfs
> does
> > not declare the nodes as dead, so hbase tries to use them (and,
> obviously,
> > fails). Note that there is now  an intermediate state for hdfs datanodes,
> > called "stale": an intermediary state where the datanode is used only if
> you
> > have to (i.e. it's the only datanode with a block replica you need). It
> will
> > be documented in HBase for the 0.96 release. But if all your datanodes
> are
> > down it won't change much.
> >
> > Cheers,
> >
> > Nicolas
> >
> >
> >
> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <[EMAIL PROTECTED]> wrote:
> >>
> >> Hey guys,
> >>
> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than 1
> >> year, and it works fine.
> >> But the datanodes got shutdown twice in the last month.
> >>
> >> When the datanodes got shutdown, all of them became "Dead Nodes" in
> >> the NN web admin UI(http://ip:50070/dfshealth.jsp),
> >> but regionservers of HBase were still live in the HBase web
> >> admin(http://ip:60010/master-status), of course, they were zombies.
> >> All of the processes of jvm were still running, including
> >> hmaster/namenode/regionserver/datanode.
> >>
> >> When the datanodes got shutdown, the load (using the "top" command) of
> >> slaves became very high, more than 10, higher than normal running.
> >> From the "top" command, we saw that the processes of datanode and
> >> regionserver were comsuming CPU.
> >>
> >> We could not stop the HBase or Hadoop cluster through normal
> >> commands(stop-*.sh/*-daemon.sh stop *).
> >> So we stopped datanodes and regionservers by kill -9 PID, then the
> >> load of slaves returned to normal level, and we start the cluster
> >> again.
> >>
> >>
> >> Log of NN at the shutdown point(All of the DNs were removed):
> >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.152:50010
> >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.149:50010
> >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.149:50010
> >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.150:50010
> >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.150:50010
> >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.148:50010
> >> 2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.148:50010
> >>
> >>
> >> Logs in DNs indicated there were many IOException and