|
Rita
2011-03-29, 12:13
Harsh J
2011-03-29, 12:49
Michael Segel
2011-03-29, 15:24
Ravi Prakash
2011-03-29, 21:37
Rita
2011-03-30, 00:13
Ravi Prakash
2011-03-30, 15:52
|
-
live/dead node problemRita 2011-03-29, 12:13
Hello All,
Is there a parameter or procedure to check more aggressively for a live/dead node? Despite me killing the hadoop process, I see the node active for more than 10+ minutes in the "Live Nodes" page. Fortunately, the last contact increments. Using, branch-0.21, 0985326 -- --- Get your facts first, then you can distort them as you please.--
-
Re: live/dead node problemHarsh J 2011-03-29, 12:49
I'm not too sure about it, but I think "dfs.client.socket-timeout" and
"dfs.datanode.socket.write.timeout" keys control the timeout values for reading/writing sockets (Defaults set by HdfsConstants.* values) in 0.21. On Tue, Mar 29, 2011 at 5:43 PM, Rita <[EMAIL PROTECTED]> wrote: > Hello All, > > Is there a parameter or procedure to check more aggressively for a live/dead > node? Despite me killing the hadoop process, I see the node active for more > than 10+ minutes in the "Live Nodes" page. Fortunately, the last contact > increments. > > > Using, branch-0.21, 0985326 > > -- > --- Get your facts first, then you can distort them as you please.-- > -- Harsh J http://harshj.com
-
RE: live/dead node problemMichael Segel 2011-03-29, 15:24
Rita, When the NameNode doesn't see a heartbeat for 10 minutes, it then recognizes that the node is down. Per the Hadoop online documentation: "Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. " I was trying to find out if there's an hdfs-site parameter that could be set to decrease this time period, but wasn't successful. HTH -Mike ---------------------------------------- > Date: Tue, 29 Mar 2011 08:13:43 -0400 > Subject: live/dead node problem > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Hello All, > > Is there a parameter or procedure to check more aggressively for a live/dead > node? Despite me killing the hadoop process, I see the node active for more > than 10+ minutes in the "Live Nodes" page. Fortunately, the last contact > increments. > > > Using, branch-0.21, 0985326 > > -- > --- Get your facts first, then you can distort them as you please.--
-
Re: live/dead node problemRavi Prakash 2011-03-29, 21:37
I set these parameters for quickly discovering live / dead nodes.
For 0.20 : heartbeat.recheck.interval For 0.22 : dfs.namenode.heartbeat.recheck-interval dfs.heartbeat.interval Cheers, Ravi On 3/29/11 10:24 AM, "Michael Segel" <[EMAIL PROTECTED]> wrote: Rita, When the NameNode doesn't see a heartbeat for 10 minutes, it then recognizes that the node is down. Per the Hadoop online documentation: "Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. " I was trying to find out if there's an hdfs-site parameter that could be set to decrease this time period, but wasn't successful. HTH -Mike ---------------------------------------- > Date: Tue, 29 Mar 2011 08:13:43 -0400 > Subject: live/dead node problem > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Hello All, > > Is there a parameter or procedure to check more aggressively for a live/dead > node? Despite me killing the hadoop process, I see the node active for more > than 10+ minutes in the "Live Nodes" page. Fortunately, the last contact > increments. > > > Using, branch-0.21, 0985326 > > -- > --- Get your facts first, then you can distort them as you please.--
-
Re: live/dead node problemRita 2011-03-30, 00:13
what about for 0.21 ?
Also, where do you set this? in the data node configuration or namenode? It seems the default is set to "3 seconds". On Tue, Mar 29, 2011 at 5:37 PM, Ravi Prakash <[EMAIL PROTECTED]>wrote: > I set these parameters for quickly discovering live / dead nodes. > > For 0.20 : heartbeat.recheck.interval > For 0.22 : dfs.namenode.heartbeat.recheck-interval dfs.heartbeat.interval > > Cheers, > Ravi > > > On 3/29/11 10:24 AM, "Michael Segel" <[EMAIL PROTECTED]> wrote: > > > > Rita, > > When the NameNode doesn't see a heartbeat for 10 minutes, it then > recognizes that the node is down. > > Per the Hadoop online documentation: > "Each DataNode sends a Heartbeat message to the NameNode periodically. A > network partition can cause a > subset of DataNodes to lose connectivity with the NameNode. The > NameNode detects this condition by the > absence of a Heartbeat message. The NameNode marks DataNodes > without recent Heartbeats as dead and > does not forward any new IO requests to them. Any data that was > registered to a dead DataNode is not available to HDFS any more. > DataNode death may cause the replication > factor of some blocks to fall below their specified value. The > NameNode constantly tracks which blocks need > to be replicated and initiates replication whenever necessary. The > necessity for re-replication may arise due > to many reasons: a DataNode may become unavailable, a replica may > become corrupted, a hard disk on a > DataNode may fail, or the replication factor of a file may be > increased. > " > > I was trying to find out if there's an hdfs-site parameter that could be > set to decrease this time period, but wasn't successful. > > HTH > > -Mike > > > ---------------------------------------- > > Date: Tue, 29 Mar 2011 08:13:43 -0400 > > Subject: live/dead node problem > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > > > Hello All, > > > > Is there a parameter or procedure to check more aggressively for a > live/dead > > node? Despite me killing the hadoop process, I see the node active for > more > > than 10+ minutes in the "Live Nodes" page. Fortunately, the last contact > > increments. > > > > > > Using, branch-0.21, 0985326 > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > > > -- --- Get your facts first, then you can distort them as you please.--
-
Re: live/dead node problemRavi Prakash 2011-03-30, 15:52
I haven't used 0.21. You can compare the source codes of the two versions.
I set these in namenode's hdfs-site.xml to 1. I'm not sure you'd want to do it on a production cluster if its a big one. On 3/29/11 7:13 PM, "Rita" <[EMAIL PROTECTED]> wrote: what about for 0.21 ? Also, where do you set this? in the data node configuration or namenode? It seems the default is set to "3 seconds". On Tue, Mar 29, 2011 at 5:37 PM, Ravi Prakash <[EMAIL PROTECTED]> wrote: I set these parameters for quickly discovering live / dead nodes. For 0.20 : heartbeat.recheck.interval For 0.22 : dfs.namenode.heartbeat.recheck-interval dfs.heartbeat.interval Cheers, Ravi On 3/29/11 10:24 AM, "Michael Segel" <[EMAIL PROTECTED] <http://[EMAIL PROTECTED]> > wrote: Rita, When the NameNode doesn't see a heartbeat for 10 minutes, it then recognizes that the node is down. Per the Hadoop online documentation: "Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. " I was trying to find out if there's an hdfs-site parameter that could be set to decrease this time period, but wasn't successful. HTH -Mike ---------------------------------------- > Date: Tue, 29 Mar 2011 08:13:43 -0400 > Subject: live/dead node problem > From: [EMAIL PROTECTED] <http://[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] <http://[EMAIL PROTECTED]> > > Hello All, > > Is there a parameter or procedure to check more aggressively for a live/dead > node? Despite me killing the hadoop process, I see the node active for more > than 10+ minutes in the "Live Nodes" page. Fortunately, the last contact > increments. > > > Using, branch-0.21, 0985326 > > -- > --- Get your facts first, then you can distort them as you please.-- |