Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
Can you show snippet from DN log which mentioned UNDER_RECOVERY ?

Here is the criteria for stale node checking to kick in (from
https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
):

+   * Check if the datanode is in stale state. Here if
+   * the namenode has not received heartbeat msg from a
+   * datanode for more than staleInterval (default value is
+   * {@link DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
+   * the datanode will be treated as stale node.
On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> Is there a place to upload these logs ?
>
>
> On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
>
> > Hi Nicholas,
> >
> > Attached are the namenode, dn logs (of one of the healthy replicas of the
> > WAL block) and the rs logs which got stuch doing the log split. Action
> > begins at 2013-04-19 00:27*.
> >
> > Also, the rogue block is 5723958680970112840_174056. Its very interesting
> > to trace this guy through the HDFS logs (dn and nn).
> >
> > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
> does
> > the stale node stuff kick in for that state ?
> >
> > Thanks
> > Varun
> >
> >
> > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <[EMAIL PROTECTED]
> >wrote:
> >
> >> Thanks for the detailed scenario and analysis. I'm going to have a look.
> >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
> >> timeouts), could you please send them directly to me?
> >>
> >> Thanks,
> >>
> >> Nicolas
> >>
> >>
> >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <[EMAIL PROTECTED]>
> >> wrote:
> >>
> >> > Hi Nicholas,
> >> >
> >> > Here is the failure scenario, I have dug up the logs.
> >> >
> >> > A machine fails and stops accepting/transmitting traffic. The HMaster
> >> > starts the distributed split for 13 tasks. There are 12 region
> servers.
> >> 12
> >> > tasks succeed but the 13th one takes a looong time.
> >> >
> >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
> >> seconds.
> >> > Both patches are there.
> >> >
> >> > a) Machine fails around 27:30
> >> > b) Master starts the split around 27:40 and submits the tasks. The one
> >> task
> >> > which fails seems to be the one which contains the WAL being currently
> >> > written to:
> >> >
> >> > 2013-04-19 00:27:44,325 INFO
> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
> >> > hdfs://
> >> >
> >> >
> >>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >> > ,
> >> > length=0
> >> >
> >> > Basically this region server picks up the task but finds the length of
> >> this
> >> > file to be 0 and drops. This happens again
> >> >
> >> > c) Finally another region server picks up the task but it ends up
> going
> >> to
> >> > the bad datanode which should not happen because of the stale node
> >> timeout)
> >> > Unfortunately it hits the 45 retries and a connect timeout of 20
> seconds
> >> > every time. This delays recovery significantly. Now I guess reducing #
> >> of
> >> > retries to 1 is one possibility.
> >> > But then the namenode logs are very interesting.
> >> >
> >> > d) Namenode seems to be in cyclic lease recovery loop until the node
> is
> >> > marked dead. There is this one last block which exhibits this.
> >> >
> >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> >> > primaryNodeIndex=1,
> >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started,
> >> > primary=10.156.192.106:50010
> >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> >> > NameSystem.internalReleaseLease: File
> >> >