Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Datanodes shutdown and HBase's regionservers not working


Copy link to this message
-
Re: Datanodes shutdown and HBase's regionservers not working
Yes, we make sure that inappropriate use of NFS leading to high load
and the lost heartbeat between cluster members.
There was a NFS partition point to one virtual machine for some
purpose, but the virtual machine shutted down frequently.
BTW, the NFS partition was not for the backup of NN metadata, just for
other temporary purpose, and it has been removed now.
The NFS partition(with autofs) for NN metadata backup has no problem.

For more info, google the "NFS high load"...
On Wed, Feb 27, 2013 at 9:58 AM, Jean-Marc Spaggiari
<[EMAIL PROTECTED]> wrote:
> Hi Davey,
>
> So were you able to find the issue?
>
> JM
>
> 2013/2/25 Davey Yan <[EMAIL PROTECTED]>:
>> Hi Nicolas,
>>
>> I think i found what led to shutdown of all of the datanodes, but i am
>> not completely certain.
>> I will return to this mail list when my cluster returns to be stable.
>>
>> On Mon, Feb 25, 2013 at 8:01 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:
>>> Network error messages are not always friendly, especially if there is a
>>> misconfiguration.
>>> This said,  "connection refused" says that the network connection was made,
>>> but that the remote port was not opened on the remote box. I.e. the process
>>> was dead.
>>> It could be useful to pastebin the whole logs as well...
>>>
>>>
>>> On Mon, Feb 25, 2013 at 12:44 PM, Davey Yan <[EMAIL PROTECTED]> wrote:
>>>>
>>>> But... there was no log like "network unreachable".
>>>>
>>>>
>>>> On Mon, Feb 25, 2013 at 6:07 PM, Nicolas Liochon <[EMAIL PROTECTED]>
>>>> wrote:
>>>> > I agree.
>>>> > Then for HDFS, ...
>>>> > The first thing to check is the network I would say.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <[EMAIL PROTECTED]> wrote:
>>>> >>
>>>> >> Thanks for reply, Nicolas.
>>>> >>
>>>> >> My question: What can lead to shutdown of all of the datanodes?
>>>> >> I believe that the regionservers will be OK if the HDFS is OK.
>>>> >>
>>>> >>
>>>> >> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <[EMAIL PROTECTED]>
>>>> >> wrote:
>>>> >> > Ok, what's your question?
>>>> >> > When you say the datanode went down, was it the datanode processes or
>>>> >> > the
>>>> >> > machines, with both the datanodes and the regionservers?
>>>> >> >
>>>> >> > The NameNode pings its datanodes every 3 seconds. However it will
>>>> >> > internally
>>>> >> > mark the datanodes as dead after 10:30 minutes (even if in the gui
>>>> >> > you
>>>> >> > have
>>>> >> > 'no answer for x minutes').
>>>> >> > HBase monitoring is done by ZooKeeper. By default, a regionserver is
>>>> >> > considered as dead after 180s with no answer. Before, well, it's
>>>> >> > considered
>>>> >> > as live.
>>>> >> > When you stop a regionserver, it tries to flush its data to the disk
>>>> >> > (i.e.
>>>> >> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if
>>>> >> > a
>>>> >> > high
>>>> >> > ratio of your datanodes are dead, it can't shutdown. Connection
>>>> >> > refused
>>>> >> > &
>>>> >> > socket timeouts come from the fact that before the 10:30 minutes hdfs
>>>> >> > does
>>>> >> > not declare the nodes as dead, so hbase tries to use them (and,
>>>> >> > obviously,
>>>> >> > fails). Note that there is now  an intermediate state for hdfs
>>>> >> > datanodes,
>>>> >> > called "stale": an intermediary state where the datanode is used only
>>>> >> > if
>>>> >> > you
>>>> >> > have to (i.e. it's the only datanode with a block replica you need).
>>>> >> > It
>>>> >> > will
>>>> >> > be documented in HBase for the 0.96 release. But if all your
>>>> >> > datanodes
>>>> >> > are
>>>> >> > down it won't change much.
>>>> >> >
>>>> >> > Cheers,
>>>> >> >
>>>> >> > Nicolas
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <[EMAIL PROTECTED]>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Hey guys,
>>>> >> >>
>>>> >> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than
>>>> >> >> 1
>>>> >> >> year, and it works fine.
>>>> >> >> But the datanodes got shutdown twice in the last month.

Davey Yan
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB