Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Datanodes shutdown and HBase's regionservers not working


+
Davey Yan 2013-02-25, 09:10
+
Davey Yan 2013-02-26, 01:54
Copy link to this message
-
Re: Datanodes shutdown and HBase's regionservers not working
Hi Davey,

So were you able to find the issue?

JM

2013/2/25 Davey Yan <[EMAIL PROTECTED]>:
> Hi Nicolas,
>
> I think i found what led to shutdown of all of the datanodes, but i am
> not completely certain.
> I will return to this mail list when my cluster returns to be stable.
>
> On Mon, Feb 25, 2013 at 8:01 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:
>> Network error messages are not always friendly, especially if there is a
>> misconfiguration.
>> This said,  "connection refused" says that the network connection was made,
>> but that the remote port was not opened on the remote box. I.e. the process
>> was dead.
>> It could be useful to pastebin the whole logs as well...
>>
>>
>> On Mon, Feb 25, 2013 at 12:44 PM, Davey Yan <[EMAIL PROTECTED]> wrote:
>>>
>>> But... there was no log like "network unreachable".
>>>
>>>
>>> On Mon, Feb 25, 2013 at 6:07 PM, Nicolas Liochon <[EMAIL PROTECTED]>
>>> wrote:
>>> > I agree.
>>> > Then for HDFS, ...
>>> > The first thing to check is the network I would say.
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <[EMAIL PROTECTED]> wrote:
>>> >>
>>> >> Thanks for reply, Nicolas.
>>> >>
>>> >> My question: What can lead to shutdown of all of the datanodes?
>>> >> I believe that the regionservers will be OK if the HDFS is OK.
>>> >>
>>> >>
>>> >> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <[EMAIL PROTECTED]>
>>> >> wrote:
>>> >> > Ok, what's your question?
>>> >> > When you say the datanode went down, was it the datanode processes or
>>> >> > the
>>> >> > machines, with both the datanodes and the regionservers?
>>> >> >
>>> >> > The NameNode pings its datanodes every 3 seconds. However it will
>>> >> > internally
>>> >> > mark the datanodes as dead after 10:30 minutes (even if in the gui
>>> >> > you
>>> >> > have
>>> >> > 'no answer for x minutes').
>>> >> > HBase monitoring is done by ZooKeeper. By default, a regionserver is
>>> >> > considered as dead after 180s with no answer. Before, well, it's
>>> >> > considered
>>> >> > as live.
>>> >> > When you stop a regionserver, it tries to flush its data to the disk
>>> >> > (i.e.
>>> >> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if
>>> >> > a
>>> >> > high
>>> >> > ratio of your datanodes are dead, it can't shutdown. Connection
>>> >> > refused
>>> >> > &
>>> >> > socket timeouts come from the fact that before the 10:30 minutes hdfs
>>> >> > does
>>> >> > not declare the nodes as dead, so hbase tries to use them (and,
>>> >> > obviously,
>>> >> > fails). Note that there is now  an intermediate state for hdfs
>>> >> > datanodes,
>>> >> > called "stale": an intermediary state where the datanode is used only
>>> >> > if
>>> >> > you
>>> >> > have to (i.e. it's the only datanode with a block replica you need).
>>> >> > It
>>> >> > will
>>> >> > be documented in HBase for the 0.96 release. But if all your
>>> >> > datanodes
>>> >> > are
>>> >> > down it won't change much.
>>> >> >
>>> >> > Cheers,
>>> >> >
>>> >> > Nicolas
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <[EMAIL PROTECTED]>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hey guys,
>>> >> >>
>>> >> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than
>>> >> >> 1
>>> >> >> year, and it works fine.
>>> >> >> But the datanodes got shutdown twice in the last month.
>>> >> >>
>>> >> >> When the datanodes got shutdown, all of them became "Dead Nodes" in
>>> >> >> the NN web admin UI(http://ip:50070/dfshealth.jsp),
>>> >> >> but regionservers of HBase were still live in the HBase web
>>> >> >> admin(http://ip:60010/master-status), of course, they were zombies.
>>> >> >> All of the processes of jvm were still running, including
>>> >> >> hmaster/namenode/regionserver/datanode.
>>> >> >>
>>> >> >> When the datanodes got shutdown, the load (using the "top" command)
>>> >> >> of
>>> >> >> slaves became very high, more than 10, higher than normal running.
>>> >> >> From the "top" command, we saw that the processes of datanode and
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB