Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
here is the snippet
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recover RBW replica
BP-696828882-10.168.7.226-1364886167971:blk_40107897639761277_174072
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recovering ReplicaBeingWritten, blk_40107897639761277_174072, RBW
2013-04-19 00:28:11,471 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at
ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
targets=[*10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010*],
newGenerationStamp=174413)
2013-04-19 00:41:20,716 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-5723958680970112840_174056, recoveryId=174413,
replica=ReplicaBeingWritten, blk_-5723958680970112840_174056, RBW
2013-04-19 00:41:20,716 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: changing replica state for
blk_-5723958680970112840_174056 from RBW to RUR
2013-04-19 00:41:20,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
updateReplica:
BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
recoveryId=174413, length=119148648, replica=ReplicaUnderRecovery,
blk_-5723958680970112840_174056, RUR
2013-04-19 00:41:20,745 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: recoverBlocks FAILED:
RecoveringBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
getBlockSize()=0; corrupt=false; offset=-1; locs=[10.156.194.94:50010,
10.156.192.106:50010, 10.156.195.38:50010]}
2013-04-19 00:41:23,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-5723958680970112840_174056, recoveryId=174418,
replica=FinalizedReplica, blk_-5723958680970112840_174413, FINALIZED
2013-04-19 00:41:23,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: changing replica state for
blk_-5723958680970112840_174056 from FINALIZED to RUR
2013-04-19 00:41:23,736 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
updateReplica:
BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174413,
recoveryId=174418, length=119148648, replica=ReplicaUnderRecovery,
blk_-5723958680970112840_174413, RUR

Block recovery takes a long time and eventually seems to fail - during
recoverBlock() call - all three datanodes (including the failed/stale one
is there)

On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>
> Here is the criteria for stale node checking to kick in (from
>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> ):
>
> +   * Check if the datanode is in stale state. Here if
> +   * the namenode has not received heartbeat msg from a
> +   * datanode for more than staleInterval (default value is
> +   * {@link
> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> +   * the datanode will be treated as stale node.
>
>
> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
>
> > Is there a place to upload these logs ?
> >
> >
> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi Nicholas,
> > >
> > > Attached are the namenode, dn logs (of one of the healthy replicas of
> the
> > > WAL block) and the rs logs which got stuch doing the log split. Action
> > > begins at 2013-04-19 00:27*.
> > >
> > > Also, the rogue block is 5723958680970112840_174056. Its very
> interesting
> > > to trace this guy through the HDFS logs (dn and nn).
> > >
> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
> > does
> > > the stale node stuff kick in for that state ?
> > >
> > > Thanks
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB