Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Terribly long HDFS timeouts while appending to HLog


+
Varun Sharma 2012-11-07, 09:43
Copy link to this message
-
Re: Terribly long HDFS timeouts while appending to HLog
Hi Varun,

HDFS-3703 and HDFS-3912 are about this.
The story is not over yet (and there are other stuff like HDFS-3704,
HDFS-3705, HDFS-3706), but it helps by lowering the probability to go to a
dead datanode: hdfs waits 10 minutes before deciding a datanode is dead,
with the jiras mentionned above, after 30s (configurable), the non
responding datanodes are not used for writes, and set to least priorities
for reads.

Cheers,

Nicolas

On Wed, Nov 7, 2012 at 10:43 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I am seeing extremely long HDFS timeouts - and this seems to be associated
> with the loss of a DataNode. Here is the RS log:
>
> 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
> exception  for block blk_2813460962462751946_78454java.io.IOException: Bad
> response 1 for block blk_2813460962462751946_78454 from datanode
> 10.31.190.107:9200
>         at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084)
>
> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
> blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200
> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
> blk_2813460962462751946_78454 in pipeline 10.31.138.245:9200,
> 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200
> 12/11/07 02:17:45 WARN wal.HLog: IPC Server handler 35 on 60020 took 65955
> ms appending an edit to hlog; editcount=476686, len~=76.0
> 12/11/07 02:17:45 WARN wal.HLog: HDFS pipeline error detected. Found 2
> replicas but expecting no less than 3 replicas.  Requesting close of hlog.
>
> The corresponding DN log goes like this
>
> 2012-11-07 02:17:45,142 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 2 for
> Block blk_2813460962462751946_78454): PacketResponder
> blk_2813460962462751946_78454 2 Exception java.net.SocketTimeoutException:
> 66000 millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.31.138.245:33965
> remote=/
> 10.31.190.107:9200]
>         at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readLong(DataInputStream.java:399)
>         at
>
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
>         at
>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:806)
>         at java.lang.Thread.run(Thread.java:662)
>
> It seems like the DataNode local to the region server is trying to grab the
> block from another DN and that is timing out because of this other data
> node being bad. All in all this causes response times to be terribly poor.
> Is there a way around this or am I missing something ?
>
> Varun
>
+
Jeremy Carroll 2012-11-07, 15:22
+
Jeremy Carroll 2012-11-07, 15:25
+
Varun Sharma 2012-11-07, 17:57
+
David Charle 2012-11-07, 18:21
+
Jeremy Carroll 2012-11-07, 19:52
+
Jeremy Carroll 2012-11-07, 19:53
+
Varun Sharma 2012-11-07, 21:52