Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Terribly long HDFS timeouts while appending to HLog


+
Varun Sharma 2012-11-07, 09:43
+
Nicolas Liochon 2012-11-07, 09:56
+
Jeremy Carroll 2012-11-07, 15:22
+
Jeremy Carroll 2012-11-07, 15:25
+
Varun Sharma 2012-11-07, 17:57
+
David Charle 2012-11-07, 18:21
+
Jeremy Carroll 2012-11-07, 19:52
+
Jeremy Carroll 2012-11-07, 19:53
Copy link to this message
-
Re: Terribly long HDFS timeouts while appending to HLog
Great - thanks for all the information.

Reducing the timeout seems like one plausible approach to me. However, I
had two follow up questions:
1) For writes - since these need to happen synchronously - if a datanode
goes down, we would essentially have some slight inconsistency in terms of
data across the 3 replicas - since that node will not have the latest
walEdit (say for a walAppend) - in such a case, how does HDFS handle this
issue ? What if the DataNode is brought back up before it expired within
the NameNode ?
2) Does the NameNode suggest DataNodes that are local to the RegionServer
if we have a read operation.

Also, I am running the DataNode with Xmx=1024M - could that be a reason
behind the data node crashes ?

Thanks
Varun

On Wed, Nov 7, 2012 at 11:53 AM, Jeremy Carroll <[EMAIL PROTECTED]> wrote:

> Er. Inconsistent. Sorry for typo. Basically when the underlying file system
> is unstable, HBase can become unstable as well.
>
> On Wed, Nov 7, 2012 at 11:52 AM, Jeremy Carroll <[EMAIL PROTECTED]>
> wrote:
>
> > It's important to realize that HBase is a strongly consistent system. So
> > if a DataNode is down (But looks alive due to HDFS not marking it as
> down).
> > The system will choose to be unavailable, rather than consistent. During
> > this timeframe where the underlying HDFS file system was not operating
> > normally (Did not mark nodes as failed), HBase can give up / timeout on a
> > lot of operations. HDFS replicates data by default factor of 3. So during
> > this time the node may have been a replication target, but unable to
> > satisfy that request.
> >
> >
> > On Wed, Nov 7, 2012 at 9:57 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
> >
> >> Thanks for the response. One more point is that I am running hadoop
> 1.0.4
> >> with hbase 0.92 - not sure if that is known to have these issues.
> >>
> >> I had one quick question though - these logs are picked from
> 10.31.138.145
> >> and from my understanding of the logs below, its still going to another
> >> bad
> >> datanode for retrieving the block even though it should already have the
> >> data block - see last line...
> >>
> >> 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
> >> exception  for block blk_2813460962462751946_78454java.io.IOException:
> Bad
> >> response 1 for block blk_2813460962462751946_78454 from datanode
> >> 10.31.190.107:9200
> >>         at
> >>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084)
> >>
> >> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
> >> blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200
> >> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
> >> blk_2813460962462751946_78454 in pipeline *10.31.138.245:9200,
> >> 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200*
> >>
> >> Looking at the DataNode logs - it seems that the local datanode is
> trying
> >> to connect to the remote bad datanode. Is this for replicating the
> >> WALEdit ?
> >>
> >> 2012-11-07 02:17:45,142 INFO
> >> org.apache.hadoop.hdfs.server.datanode.DataNode
> >> (PacketResponder 2 for Block blk_2813460962462751946_78454):
> >> PacketResponder blk_2813460962462751946_78454 2 Exception
> >> java.net.SocketTimeoutException:
> >> 66000 millis timeout while waiting for channel to be ready for read. ch
> :
> >> java.nio.channels.SocketChannel[*connected local=/**10.31.138.245:33965
> >> remote=/10.31.190.107:9200]*
> >> *
> >> *
> >> Also, this is preceded by a whole bunch of slow operations with
> >> processingtimems close to 20 seconds like these - are these other slow
> >> walEdit appends (slowed down due to HDFS) ?
> >>
> >> 12/11/07 02:16:01 WARN ipc.HBaseServer: (responseTooSlow):
> >>
> >>
> {"processingtimems":21957,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@7198c05d
> >> ),
> >> rpc version=1, client version=29,
> methodsFingerPrint=54742778","client":"
> >> 10.31.128.131:55327
> >>
> >>
> ","starttimems":1352254539935,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB