Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - region servers failing due to bad datanode


Copy link to this message
-
region servers failing due to bad datanode
prem yadav 2012-08-20, 13:23
Hi,
we have been facing some datanode related issues lately due to which the
region servers keep failing.
our cluster structure is as follows:

Versions:
Hadoop -1.0.1
Hbase- 94.1

All the machine are running datanodes,tasktrackers,regionservers, and
map-reduce(rarely). These are all ec2 m1.large machines and have 7.5 GB
memory each. Region servers are assigned 4GB of memory.
It looks like for some reason, the datanode fails to respond to the region
server's query for a block and a timeout exception occurs. This causes the
region server to fail.
In some cases, we have also seen that the datanode commits the block with a
different block name. This is evident from the logs,
"oldblock=blk_-7841650651979512601_775949(length=32204106),
newblock=blk_-7841650651979512601_775977(length=32204106),
datanode=<ip>:50010"

In this case, region server keeps querying for the old block name and gets
an error on the lines of

" org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_8680479961374491733_745849 failed  because recovery from primary
datanode <ip-address>:50010 failed 6 times"
The logs we get on the region server are:

2012-08-20 00:03:28,821 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-7841650651979512601_775949 in pipeline <ip>:50010,
<ip>:50010, <ip>:50010: bad datanode <datanode_ip>:50010

2012-08-20 00:03:28,758 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-7841650651979512601_775949 bad datanode[0]
<datanode_ip>:50010
or something like the following:

 org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block
blk_-7841650651979512601_775949java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/<local_ip>:37227
remote=/<local_ip>:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readLong(DataInputStream.java:416)
at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967)

The namenode logs:

2012-08-20 00:03:29,446 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_-7841650651979512601_775949,
newgenerationstamp=775977, newlength=32204106, newtargets=[<ip-address of
datanodes>], closeFile=false, deleteBlock=false)

2012-08-19:2012-08-19 23:59:18,995 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.allocateBlock:
/hbase/.logs/<regionserver>,60020,1345222869339/<region-server>%2C60020%2C1345222869339.1345420758726.
blk_-7841650651979512601_775949
Datanode logs:

2012-08-19:2012-08-19 23:59:18,999 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-7841650651979512601_775949 src: /<ip>:42937 dest: /<ip>:50010
2012-08-20 00:03:28,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
for block blk_-7841650651979512601_775949 java.io.EOFException: while
trying to read 65557 bytes
2012-08-20 00:03:28,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-7841650651979512601_775949 0 : Thread is interrupted.
2012-08-20 00:03:28,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_-7841650651979512601_775949 terminating
2012-08-20 00:03:28,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-7841650651979512601_775949 received exception java.io.EOFException:
while trying to read 65557 bytes
2012-08-20 00:03:29,264 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_-7841650651979512601_775949, targets=[<ip>:50010,
<ip>:50010])
2012-08-20 00:03:29,440 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
oldblock=blk_-7841650651979512601_775949(length=32204106),
newblock=blk_-7841650651979512601_775977(length=32204106),
datanode=<ip>:50010
We have seen multiple posts regarding the problem but could not find a
solution to it. We thought the region servers should be able to handle
these problems but it looks like they aren't.
How do we resolve this? Is there some tuning we need to do for the
datanodes?
+
Khang Pham 2012-08-20, 14:51
+
Rajesh M 2012-08-21, 07:10