|
|
-
region servers failing due to bad datanodeprem yadav 2012-08-20, 13:23
Hi,
we have been facing some datanode related issues lately due to which the region servers keep failing. our cluster structure is as follows: Versions: Hadoop -1.0.1 Hbase- 94.1 All the machine are running datanodes,tasktrackers,regionservers, and map-reduce(rarely). These are all ec2 m1.large machines and have 7.5 GB memory each. Region servers are assigned 4GB of memory. It looks like for some reason, the datanode fails to respond to the region server's query for a block and a timeout exception occurs. This causes the region server to fail. In some cases, we have also seen that the datanode commits the block with a different block name. This is evident from the logs, "oldblock=blk_-7841650651979512601_775949(length=32204106), newblock=blk_-7841650651979512601_775977(length=32204106), datanode=<ip>:50010" In this case, region server keeps querying for the old block name and gets an error on the lines of " org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_8680479961374491733_745849 failed because recovery from primary datanode <ip-address>:50010 failed 6 times" The logs we get on the region server are: 2012-08-20 00:03:28,821 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-7841650651979512601_775949 in pipeline <ip>:50010, <ip>:50010, <ip>:50010: bad datanode <datanode_ip>:50010 2012-08-20 00:03:28,758 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-7841650651979512601_775949 bad datanode[0] <datanode_ip>:50010 or something like the following: org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-7841650651979512601_775949java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<local_ip>:37227 remote=/<local_ip>:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readLong(DataInputStream.java:416) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967) The namenode logs: 2012-08-20 00:03:29,446 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_-7841650651979512601_775949, newgenerationstamp=775977, newlength=32204106, newtargets=[<ip-address of datanodes>], closeFile=false, deleteBlock=false) 2012-08-19:2012-08-19 23:59:18,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.logs/<regionserver>,60020,1345222869339/<region-server>%2C60020%2C1345222869339.1345420758726. blk_-7841650651979512601_775949 Datanode logs: 2012-08-19:2012-08-19 23:59:18,999 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-7841650651979512601_775949 src: /<ip>:42937 dest: /<ip>:50010 2012-08-20 00:03:28,831 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-7841650651979512601_775949 java.io.EOFException: while trying to read 65557 bytes 2012-08-20 00:03:28,831 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-7841650651979512601_775949 0 : Thread is interrupted. 2012-08-20 00:03:28,831 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-7841650651979512601_775949 terminating 2012-08-20 00:03:28,831 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-7841650651979512601_775949 received exception java.io.EOFException: while trying to read 65557 bytes 2012-08-20 00:03:29,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls recoverBlock(block=blk_-7841650651979512601_775949, targets=[<ip>:50010, <ip>:50010]) 2012-08-20 00:03:29,440 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: oldblock=blk_-7841650651979512601_775949(length=32204106), newblock=blk_-7841650651979512601_775977(length=32204106), datanode=<ip>:50010 We have seen multiple posts regarding the problem but could not find a solution to it. We thought the region servers should be able to handle these problems but it looks like they aren't. How do we resolve this? Is there some tuning we need to do for the datanodes? |