Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Slow region server recoveries

Copy link to this message
Re: Slow region server recoveries

I looked at it again with a fresh eye. As Varun was saying, the root cause
is the wrong order of the block locations.

The root cause of the root cause is actually simple: HBASE started the
recovery while the node was not yet stale from an HDFS pov.

Varun mentioned this timing:
Lost Beat: 27:30
Became stale: 27:50 - * this is a guess and reverse engineered (stale
timeout 20 seconds)
Became dead: 37:51

But the  recovery started at 27:13 (15 seconds before we have this log
2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to / for file
for block
15000 millis timeout while waiting for channel to be ready for connect. ch
: java.nio.channels.SocketChannel[connection-pending remote=/]

So when we took the blocks from the NN, the datanode was not stale, so you
have the wrong (random) order.

ZooKeeper can expire a session before the timeout. I don't what why it does
this in this case, but I don't consider it as a ZK bug: if ZK knows that a
node is dead, it's its role to expire the session. There is something more
fishy: we started the recovery while the datanode was still responding to
heartbeat. I don't know why. Maybe the OS has been able to kill 15 the RS
before vanishing away.

Anyway, we then have an exception when we try to connect, because the RS
does not have a TCP connection to this datanode. And this is retried many

You would not have this with trunk, because HBASE-6435 reorders the blocks
inside the client, using an information not available to the NN, excluding
the datanode of the region server under recovery.

Some conclusions:
 - we should likely backport hbase-6435 to 0.94.
 - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get
 - There are some stuff that could be better in HDFS. I will see.
 - I'm worried by the SocketTimeoutException. We should get NoRouteToHost
at a moment, and we don't. That's also why it takes ages. I think it's an
AWS thing, but it brings to issue: it's slow, and, in HBase, you don't know
if the operation could have been executed or not, so it adds complexity to
some scenarios. If someone with enough network and AWS knowledge could
clarify this point it would be great.



On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> This is 0.94.3 hbase...
> On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:
> > Hi Ted,
> >
> > I had a long offline discussion with nicholas on this. Looks like the
> last
> > block which was still being written too, took an enormous time to
> recover.
> > Here's what happened.
> > a) Master split tasks and region servers process them
> > b) Region server tries to recover lease for each WAL log - most cases are
> > noop since they are already rolled over/finalized
> > c) The last file lease recovery takes some time since the crashing server
> > was writing to it and had a lease on it - but basically we have the
> lease 1
> > minute after the server was lost
> > d) Now we start the recovery for this but we end up hitting the stale
> data
> > node which is puzzling.
> >
> > It seems that we did not hit the stale datanode when we were trying to
> > recover the finalized WAL blocks with trivial lease recovery. However,
> for
> > the final block, we hit the stale datanode. Any clue why this might be
> > happening ?
> >
> > Varun
> >
> >
> > On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> >> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
> >>
> >> Here is the criteria for stale node checking to kick in (from
> >>
> >>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> >> ):
> >>
> >> +   * Check if the datanode is in stale state. Here if