Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
Nicolas Liochon 2013-04-22, 07:51
The recovery starts at 27:40 (master log: 00:27:40,011), so before the
datanode is known as stale.
But the first attempt is cancelled, and a new one start at 28:10, before
being cancelled again at 28:35 (that's HBASE-6738).
These two attempts should see the datanode as stale. It seems it's not the
case. I don't know why it's not reordered. A possibility is the stale ratio
is not configured, or the other nodes were stale as well. Or HDFS doesn't
see the node as stale.

In any case, the story would be different with HBASE-8276 (i.e. HBASE-6738
backport to 94.7) + HBASE-6435.
If you have a reproduction scenario, it's worth trying.

For HBASE-8389, yes, we need this to work...

Nicolas
<https://issues.apache.org/jira/browse/HBASE-8389>
On Sun, Apr 21, 2013 at 7:38 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> Hi Ted, Nicholas,
>
> Thanks for the comments. We found some issues with lease recovery and I
> patched HBASE 8354 to ensure we don't see data loss. Could you please look
> at HDFS 4721 and HBASE 8389 ?
>
> Thanks
> Varun
>
>
> On Sat, Apr 20, 2013 at 10:52 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
>
> > The important thing to note is the block for this rogue WAL is
> > UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node
> > thing kicks in correctly for UNDER_RECOVERY blocks but failed.
> >
> >
> > On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma <[EMAIL PROTECTED]
> >wrote:
> >
> >> Hi Nicholas,
> >>
> >> Regarding the following, I think this is not a recovery - the file below
> >> is an HFIle and is being accessed on a get request. On this cluster, I
> >> don't have block locality. I see these exceptions for a while and then
> they
> >> are gone, which means the stale node thing kicks in.
> >>
> >> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> >> connect to /10.156.194.94:50010 for file
> >> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
> >> 02f639bb43944d4ba9abcf58287831c0
> >> for block
> >>
> >> This is the real bummer. The stale datanode is 1st even 90 seconds
> >> afterwards.
> >>
> >> *2013-04-19 00:28:35*,777 WARN
> >> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
> >> hdfs://
> >>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed,
> returning error
> >> java.io.IOException: Cannot obtain block length for
> >>
> LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
> >> getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
> >> 10.156.192.106:50010, 10.156.195.38:50010]}*
> >> >---at
> >>
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
> >> >---at
> >> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
> >> >---at
> >> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
> >> >---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
> >> >---at
> >>
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
> >> >---at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
> >> >---at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)