Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
Hi Nicholas,

Regarding the following, I think this is not a recovery - the file below is
an HFIle and is being accessed on a get request. On this cluster, I don't
have block locality. I see these exceptions for a while and then they are
gone, which means the stale node thing kicks in.

2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.156.194.94:50010 for file
/hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
02f639bb43944d4ba9abcf58287831c0
for block

This is the real bummer. The stale datanode is 1st even 90 seconds
afterwards.

*2013-04-19 00:28:35*,777 WARN
org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
hdfs://
ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed,
returning error
java.io.IOException: Cannot obtain block length for
LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
10.156.192.106:50010, 10.156.195.38:50010]}*
>---at
org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
>---at
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
>---at
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
>---at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
>---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
>---at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
>---at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
>---at
org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
>---at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
>---at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
>---at java.lang.Thread.run(Thread.java:662)

On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I looked at it again with a fresh eye. As Varun was saying, the root cause
> is the wrong order of the block locations.
>
> The root cause of the root cause is actually simple: HBASE started the
> recovery while the node was not yet stale from an HDFS pov.
>
> Varun mentioned this timing:
> Lost Beat: 27:30
> Became stale: 27:50 - * this is a guess and reverse engineered (stale
> timeout 20 seconds)
> Became dead: 37:51
>
> But the  recovery started at 27:13 (15 seconds before we have this log
> line)
> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.156.194.94:50010 for file
>
> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
> for block