Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Datanodes shutdown and HBase's regionservers not working


Copy link to this message
-
Re: Datanodes shutdown and HBase's regionservers not working
Jean-Marc Spaggiari 2013-02-27, 01:58
Hi Davey,

So were you able to find the issue?

JM

2013/2/25 Davey Yan <[EMAIL PROTECTED]>:
> Hi Nicolas,
>
> I think i found what led to shutdown of all of the datanodes, but i am
> not completely certain.
> I will return to this mail list when my cluster returns to be stable.
>
> On Mon, Feb 25, 2013 at 8:01 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:
>> Network error messages are not always friendly, especially if there is a
>> misconfiguration.
>> This said,  "connection refused" says that the network connection was made,
>> but that the remote port was not opened on the remote box. I.e. the process
>> was dead.
>> It could be useful to pastebin the whole logs as well...
>>
>>
>> On Mon, Feb 25, 2013 at 12:44 PM, Davey Yan <[EMAIL PROTECTED]> wrote:
>>>
>>> But... there was no log like "network unreachable".
>>>
>>>
>>> On Mon, Feb 25, 2013 at 6:07 PM, Nicolas Liochon <[EMAIL PROTECTED]>
>>> wrote:
>>> > I agree.
>>> > Then for HDFS, ...
>>> > The first thing to check is the network I would say.
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <[EMAIL PROTECTED]> wrote:
>>> >>
>>> >> Thanks for reply, Nicolas.
>>> >>
>>> >> My question: What can lead to shutdown of all of the datanodes?
>>> >> I believe that the regionservers will be OK if the HDFS is OK.
>>> >>
>>> >>
>>> >> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <[EMAIL PROTECTED]>
>>> >> wrote:
>>> >> > Ok, what's your question?
>>> >> > When you say the datanode went down, was it the datanode processes or
>>> >> > the
>>> >> > machines, with both the datanodes and the regionservers?
>>> >> >
>>> >> > The NameNode pings its datanodes every 3 seconds. However it will
>>> >> > internally
>>> >> > mark the datanodes as dead after 10:30 minutes (even if in the gui
>>> >> > you
>>> >> > have
>>> >> > 'no answer for x minutes').
>>> >> > HBase monitoring is done by ZooKeeper. By default, a regionserver is
>>> >> > considered as dead after 180s with no answer. Before, well, it's
>>> >> > considered
>>> >> > as live.
>>> >> > When you stop a regionserver, it tries to flush its data to the disk
>>> >> > (i.e.
>>> >> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if
>>> >> > a
>>> >> > high
>>> >> > ratio of your datanodes are dead, it can't shutdown. Connection
>>> >> > refused
>>> >> > &
>>> >> > socket timeouts come from the fact that before the 10:30 minutes hdfs
>>> >> > does
>>> >> > not declare the nodes as dead, so hbase tries to use them (and,
>>> >> > obviously,
>>> >> > fails). Note that there is now  an intermediate state for hdfs
>>> >> > datanodes,
>>> >> > called "stale": an intermediary state where the datanode is used only
>>> >> > if
>>> >> > you
>>> >> > have to (i.e. it's the only datanode with a block replica you need).
>>> >> > It
>>> >> > will
>>> >> > be documented in HBase for the 0.96 release. But if all your
>>> >> > datanodes
>>> >> > are
>>> >> > down it won't change much.
>>> >> >
>>> >> > Cheers,
>>> >> >
>>> >> > Nicolas
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <[EMAIL PROTECTED]>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hey guys,
>>> >> >>
>>> >> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than
>>> >> >> 1
>>> >> >> year, and it works fine.
>>> >> >> But the datanodes got shutdown twice in the last month.
>>> >> >>
>>> >> >> When the datanodes got shutdown, all of them became "Dead Nodes" in
>>> >> >> the NN web admin UI(http://ip:50070/dfshealth.jsp),
>>> >> >> but regionservers of HBase were still live in the HBase web
>>> >> >> admin(http://ip:60010/master-status), of course, they were zombies.
>>> >> >> All of the processes of jvm were still running, including
>>> >> >> hmaster/namenode/regionserver/datanode.
>>> >> >>
>>> >> >> When the datanodes got shutdown, the load (using the "top" command)
>>> >> >> of
>>> >> >> slaves became very high, more than 10, higher than normal running.
>>> >> >> From the "top" command, we saw that the processes of datanode and