Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: Datanodes shutdown and HBase's regionservers not working


Copy link to this message
-
Re: Datanodes shutdown and HBase's regionservers not working
Davey Yan 2013-02-25, 11:44
But... there was no log like "network unreachable".
On Mon, Feb 25, 2013 at 6:07 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:
> I agree.
> Then for HDFS, ...
> The first thing to check is the network I would say.
>
>
>
>
> On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <[EMAIL PROTECTED]> wrote:
>>
>> Thanks for reply, Nicolas.
>>
>> My question: What can lead to shutdown of all of the datanodes?
>> I believe that the regionservers will be OK if the HDFS is OK.
>>
>>
>> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <[EMAIL PROTECTED]>
>> wrote:
>> > Ok, what's your question?
>> > When you say the datanode went down, was it the datanode processes or
>> > the
>> > machines, with both the datanodes and the regionservers?
>> >
>> > The NameNode pings its datanodes every 3 seconds. However it will
>> > internally
>> > mark the datanodes as dead after 10:30 minutes (even if in the gui you
>> > have
>> > 'no answer for x minutes').
>> > HBase monitoring is done by ZooKeeper. By default, a regionserver is
>> > considered as dead after 180s with no answer. Before, well, it's
>> > considered
>> > as live.
>> > When you stop a regionserver, it tries to flush its data to the disk
>> > (i.e.
>> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if a
>> > high
>> > ratio of your datanodes are dead, it can't shutdown. Connection refused
>> > &
>> > socket timeouts come from the fact that before the 10:30 minutes hdfs
>> > does
>> > not declare the nodes as dead, so hbase tries to use them (and,
>> > obviously,
>> > fails). Note that there is now  an intermediate state for hdfs
>> > datanodes,
>> > called "stale": an intermediary state where the datanode is used only if
>> > you
>> > have to (i.e. it's the only datanode with a block replica you need). It
>> > will
>> > be documented in HBase for the 0.96 release. But if all your datanodes
>> > are
>> > down it won't change much.
>> >
>> > Cheers,
>> >
>> > Nicolas
>> >
>> >
>> >
>> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <[EMAIL PROTECTED]> wrote:
>> >>
>> >> Hey guys,
>> >>
>> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than 1
>> >> year, and it works fine.
>> >> But the datanodes got shutdown twice in the last month.
>> >>
>> >> When the datanodes got shutdown, all of them became "Dead Nodes" in
>> >> the NN web admin UI(http://ip:50070/dfshealth.jsp),
>> >> but regionservers of HBase were still live in the HBase web
>> >> admin(http://ip:60010/master-status), of course, they were zombies.
>> >> All of the processes of jvm were still running, including
>> >> hmaster/namenode/regionserver/datanode.
>> >>
>> >> When the datanodes got shutdown, the load (using the "top" command) of
>> >> slaves became very high, more than 10, higher than normal running.
>> >> From the "top" command, we saw that the processes of datanode and
>> >> regionserver were comsuming CPU.
>> >>
>> >> We could not stop the HBase or Hadoop cluster through normal
>> >> commands(stop-*.sh/*-daemon.sh stop *).
>> >> So we stopped datanodes and regionservers by kill -9 PID, then the
>> >> load of slaves returned to normal level, and we start the cluster
>> >> again.
>> >>
>> >>
>> >> Log of NN at the shutdown point(All of the DNs were removed):
>> >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology:
>> >> Removing a node: /default-rack/192.168.1.152:50010
>> >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange:
>> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
>> >> 192.168.1.149:50010
>> >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology:
>> >> Removing a node: /default-rack/192.168.1.149:50010
>> >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange:
>> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
>> >> 192.168.1.150:50010
>> >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology:
>> >> Removing a node: /default-rack/192.168.1.150:50010
>> >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange:

Davey Yan