Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
Varun Sharma 2013-04-19, 20:09
Hi Ted,

I had a long offline discussion with nicholas on this. Looks like the last
block which was still being written too, took an enormous time to recover.
Here's what happened.
a) Master split tasks and region servers process them
b) Region server tries to recover lease for each WAL log - most cases are
noop since they are already rolled over/finalized
c) The last file lease recovery takes some time since the crashing server
was writing to it and had a lease on it - but basically we have the lease 1
minute after the server was lost
d) Now we start the recovery for this but we end up hitting the stale data
node which is puzzling.

It seems that we did not hit the stale datanode when we were trying to
recover the finalized WAL blocks with trivial lease recovery. However, for
the final block, we hit the stale datanode. Any clue why this might be
happening ?

Varun
On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>
> Here is the criteria for stale node checking to kick in (from
>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> ):
>
> +   * Check if the datanode is in stale state. Here if
> +   * the namenode has not received heartbeat msg from a
> +   * datanode for more than staleInterval (default value is
> +   * {@link
> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> +   * the datanode will be treated as stale node.
>
>
> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
>
> > Is there a place to upload these logs ?
> >
> >
> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi Nicholas,
> > >
> > > Attached are the namenode, dn logs (of one of the healthy replicas of
> the
> > > WAL block) and the rs logs which got stuch doing the log split. Action
> > > begins at 2013-04-19 00:27*.
> > >
> > > Also, the rogue block is 5723958680970112840_174056. Its very
> interesting
> > > to trace this guy through the HDFS logs (dn and nn).
> > >
> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
> > does
> > > the stale node stuff kick in for that state ?
> > >
> > > Thanks
> > > Varun
> > >
> > >
> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <[EMAIL PROTECTED]
> > >wrote:
> > >
> > >> Thanks for the detailed scenario and analysis. I'm going to have a
> look.
> > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
> > >> timeouts), could you please send them directly to me?
> > >>
> > >> Thanks,
> > >>
> > >> Nicolas
> > >>
> > >>
> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <[EMAIL PROTECTED]>
> > >> wrote:
> > >>
> > >> > Hi Nicholas,
> > >> >
> > >> > Here is the failure scenario, I have dug up the logs.
> > >> >
> > >> > A machine fails and stops accepting/transmitting traffic. The
> HMaster
> > >> > starts the distributed split for 13 tasks. There are 12 region
> > servers.
> > >> 12
> > >> > tasks succeed but the 13th one takes a looong time.
> > >> >
> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
> > >> seconds.
> > >> > Both patches are there.
> > >> >
> > >> > a) Machine fails around 27:30
> > >> > b) Master starts the split around 27:40 and submits the tasks. The
> one
> > >> task
> > >> > which fails seems to be the one which contains the WAL being
> currently
> > >> > written to:
> > >> >
> > >> > 2013-04-19 00:27:44,325 INFO
> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
> hlog:
> > >> > hdfs://
> > >> >
> > >> >
> > >>
> >
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > ,
> > >> > length=0
> > >> >
> > >> > Basically this region server picks up the task but finds the length
> of
> > >> this
> > >> > file to be 0 and drops. This happens again