Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
Hi Nicholas,

Regarding the following, I think this is not a recovery - the file below is
an HFIle and is being accessed on a get request. On this cluster, I don't
have block locality. I see these exceptions for a while and then they are
gone, which means the stale node thing kicks in.

2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.156.194.94:50010 for file
/hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
02f639bb43944d4ba9abcf58287831c0
for block

This is the real bummer. The stale datanode is 1st even 90 seconds
afterwards.

*2013-04-19 00:28:35*,777 WARN
org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
hdfs://
ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed,
returning error
java.io.IOException: Cannot obtain block length for
LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
10.156.192.106:50010, 10.156.195.38:50010]}*
>---at
org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
>---at
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
>---at
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
>---at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
>---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
>---at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
>---at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
>---at
org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
>---at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
>---at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
>---at java.lang.Thread.run(Thread.java:662)

On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I looked at it again with a fresh eye. As Varun was saying, the root cause
> is the wrong order of the block locations.
>
> The root cause of the root cause is actually simple: HBASE started the
> recovery while the node was not yet stale from an HDFS pov.
>
> Varun mentioned this timing:
> Lost Beat: 27:30
> Became stale: 27:50 - * this is a guess and reverse engineered (stale
> timeout 20 seconds)
> Became dead: 37:51
>
> But the  recovery started at 27:13 (15 seconds before we have this log
> line)
> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.156.194.94:50010 for file
>
> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
> for block
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB