Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Question about dead datanode


Copy link to this message
-
Re: Question about dead datanode
Looks like I patched it in DFSClient.java, here is the patch:
https://gist.github.com/anonymous/9028934

So, this issue was this,

public class DFSInputStream is the class that is started as a thread,
and it used to maintain 'deadNodes' list of datanodes that had
problems, (in our case datanode lost power and was down).  Since each
thread that ran DFSInputStream class, had its own deadNodes instance
that was empty there were _tons_ of errors (over period of 4 days!).
My changes are simple.

I moved 'deadNodes' list outside as global field that is accessible by
all running threads, so at any point datanode does go down, each
thread is basically informed that the datanode _is_ down.

I did not want to mess with caching of locatedBlocks, so I basically
installed a dampening counter that keeps track of DFSClient trying to
access 'bad/dead' datanode, I arbitrarily chose to value to be '10'.
After 10 attempts the DFSClient resumes to try to contact datanode, by
which time, its hopefully is up.

In Summary, all threads are informed of bad datanodes, so there are no
attempts to try to contact it unless a counter <datanode, count> is
greater than 10.  The better solution would have been to invalidate
locatedBlocks cache also, but this seems like a huge improvement.

Here is the log of my testing in our live cluster:

at 19:34:42, I kill datanode, and its put on deadNodes list,
at 19:47:05, its back up, and counter is > 10, so its used again.
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Failed
to connect to /10.101.5.5:50010 for file
/hbase/img863/36b17cc018e4b8494ef700523628054a/att/7640828832753135438
for block -4025527892682081728: Will add to deadNodes:
java.net.ConnectException: Connection refused
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Adding
server to deadNodes, maybe? 10.101.5.5:50010
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Inside
addToDeadNodes Print All DeadNodes:: 10.101.5.5:50010
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:35:49,881 WARN org.apache.hadoop.hdfs.DFSClient: Inside
addToDeadNodes Print All DeadNodes:: 10.101.5.5:50010
2014-02-15 19:36:32,547 WARN org.apache.hadoop.hdfs.DFSClient: Remove
Node from deadNodes:: 10.103.2.5:50010 at counter
{10.103.2.5:50010=10, 10.101.5.5:50010=1}
2014-02-15 19:39:23,662 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,878 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,944 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,962 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,979 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,667 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,708 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,718 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,933 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient: Remove
Node from deadNodes:: 10.101.5.5:50010 at counter {10.103.2.5:50010=0,
10.101.5.5:50010=10}
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient: Found
bestNode:: 10.101.5.5:50010
2014-02-15 19:47:05,686 INFO org.apache.hadoop.hdfs.DFSClient:
Datanode available for block: 10.101.5.5:50010
-Jack

On Fri, Feb 14, 2014 at 10:16 AM, Jack Levin <[EMAIL PROTECTED]> wrote: