Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
Varun Sharma 2013-04-19, 17:53
here is the snippet
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recover RBW replica
BP-696828882-10.168.7.226-1364886167971:blk_40107897639761277_174072
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recovering ReplicaBeingWritten, blk_40107897639761277_174072, RBW
2013-04-19 00:28:11,471 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at
ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
targets=[*10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010*],
newGenerationStamp=174413)
2013-04-19 00:41:20,716 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-5723958680970112840_174056, recoveryId=174413,
replica=ReplicaBeingWritten, blk_-5723958680970112840_174056, RBW
2013-04-19 00:41:20,716 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: changing replica state for
blk_-5723958680970112840_174056 from RBW to RUR
2013-04-19 00:41:20,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
updateReplica:
BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
recoveryId=174413, length=119148648, replica=ReplicaUnderRecovery,
blk_-5723958680970112840_174056, RUR
2013-04-19 00:41:20,745 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: recoverBlocks FAILED:
RecoveringBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
getBlockSize()=0; corrupt=false; offset=-1; locs=[10.156.194.94:50010,
10.156.192.106:50010, 10.156.195.38:50010]}
2013-04-19 00:41:23,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-5723958680970112840_174056, recoveryId=174418,
replica=FinalizedReplica, blk_-5723958680970112840_174413, FINALIZED
2013-04-19 00:41:23,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: changing replica state for
blk_-5723958680970112840_174056 from FINALIZED to RUR
2013-04-19 00:41:23,736 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
updateReplica:
BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174413,
recoveryId=174418, length=119148648, replica=ReplicaUnderRecovery,
blk_-5723958680970112840_174413, RUR

Block recovery takes a long time and eventually seems to fail - during
recoverBlock() call - all three datanodes (including the failed/stale one
is there)

On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>
> Here is the criteria for stale node checking to kick in (from
>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> ):
>
> +   * Check if the datanode is in stale state. Here if
> +   * the namenode has not received heartbeat msg from a
> +   * datanode for more than staleInterval (default value is
> +   * {@link
> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> +   * the datanode will be treated as stale node.
>
>
> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
>
> > Is there a place to upload these logs ?
> >
> >
> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi Nicholas,
> > >
> > > Attached are the namenode, dn logs (of one of the healthy replicas of
> the
> > > WAL block) and the rs logs which got stuch doing the log split. Action
> > > begins at 2013-04-19 00:27*.
> > >
> > > Also, the rogue block is 5723958680970112840_174056. Its very
> interesting
> > > to trace this guy through the HDFS logs (dn and nn).
> > >
> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
> > does
> > > the stale node stuff kick in for that state ?
> > >
> > > Thanks