|
|
-
Re: Datanodes shutdown and HBase's regionservers not workingJean-Marc Spaggiari 2013-02-27, 01:58
Hi Davey,
So were you able to find the issue? JM 2013/2/25 Davey Yan <[EMAIL PROTECTED]>: > Hi Nicolas, > > I think i found what led to shutdown of all of the datanodes, but i am > not completely certain. > I will return to this mail list when my cluster returns to be stable. > > On Mon, Feb 25, 2013 at 8:01 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote: >> Network error messages are not always friendly, especially if there is a >> misconfiguration. >> This said, "connection refused" says that the network connection was made, >> but that the remote port was not opened on the remote box. I.e. the process >> was dead. >> It could be useful to pastebin the whole logs as well... >> >> >> On Mon, Feb 25, 2013 at 12:44 PM, Davey Yan <[EMAIL PROTECTED]> wrote: >>> >>> But... there was no log like "network unreachable". >>> >>> >>> On Mon, Feb 25, 2013 at 6:07 PM, Nicolas Liochon <[EMAIL PROTECTED]> >>> wrote: >>> > I agree. >>> > Then for HDFS, ... >>> > The first thing to check is the network I would say. >>> > >>> > >>> > >>> > >>> > On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <[EMAIL PROTECTED]> wrote: >>> >> >>> >> Thanks for reply, Nicolas. >>> >> >>> >> My question: What can lead to shutdown of all of the datanodes? >>> >> I believe that the regionservers will be OK if the HDFS is OK. >>> >> >>> >> >>> >> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <[EMAIL PROTECTED]> >>> >> wrote: >>> >> > Ok, what's your question? >>> >> > When you say the datanode went down, was it the datanode processes or >>> >> > the >>> >> > machines, with both the datanodes and the regionservers? >>> >> > >>> >> > The NameNode pings its datanodes every 3 seconds. However it will >>> >> > internally >>> >> > mark the datanodes as dead after 10:30 minutes (even if in the gui >>> >> > you >>> >> > have >>> >> > 'no answer for x minutes'). >>> >> > HBase monitoring is done by ZooKeeper. By default, a regionserver is >>> >> > considered as dead after 180s with no answer. Before, well, it's >>> >> > considered >>> >> > as live. >>> >> > When you stop a regionserver, it tries to flush its data to the disk >>> >> > (i.e. >>> >> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if >>> >> > a >>> >> > high >>> >> > ratio of your datanodes are dead, it can't shutdown. Connection >>> >> > refused >>> >> > & >>> >> > socket timeouts come from the fact that before the 10:30 minutes hdfs >>> >> > does >>> >> > not declare the nodes as dead, so hbase tries to use them (and, >>> >> > obviously, >>> >> > fails). Note that there is now an intermediate state for hdfs >>> >> > datanodes, >>> >> > called "stale": an intermediary state where the datanode is used only >>> >> > if >>> >> > you >>> >> > have to (i.e. it's the only datanode with a block replica you need). >>> >> > It >>> >> > will >>> >> > be documented in HBase for the 0.96 release. But if all your >>> >> > datanodes >>> >> > are >>> >> > down it won't change much. >>> >> > >>> >> > Cheers, >>> >> > >>> >> > Nicolas >>> >> > >>> >> > >>> >> > >>> >> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <[EMAIL PROTECTED]> >>> >> > wrote: >>> >> >> >>> >> >> Hey guys, >>> >> >> >>> >> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than >>> >> >> 1 >>> >> >> year, and it works fine. >>> >> >> But the datanodes got shutdown twice in the last month. >>> >> >> >>> >> >> When the datanodes got shutdown, all of them became "Dead Nodes" in >>> >> >> the NN web admin UI(http://ip:50070/dfshealth.jsp), >>> >> >> but regionservers of HBase were still live in the HBase web >>> >> >> admin(http://ip:60010/master-status), of course, they were zombies. >>> >> >> All of the processes of jvm were still running, including >>> >> >> hmaster/namenode/regionserver/datanode. >>> >> >> >>> >> >> When the datanodes got shutdown, the load (using the "top" command) >>> >> >> of >>> >> >> slaves became very high, more than 10, higher than normal running. >>> >> >> From the "top" command, we saw that the processes of datanode and |