Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
The important thing to note is the block for this rogue WAL is
UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node
thing kicks in correctly for UNDER_RECOVERY blocks but failed.
On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> Hi Nicholas,
>
> Regarding the following, I think this is not a recovery - the file below
> is an HFIle and is being accessed on a get request. On this cluster, I
> don't have block locality. I see these exceptions for a while and then they
> are gone, which means the stale node thing kicks in.
>
> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.156.194.94:50010 for file
> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
> 02f639bb43944d4ba9abcf58287831c0
> for block
>
> This is the real bummer. The stale datanode is 1st even 90 seconds
> afterwards.
>
> *2013-04-19 00:28:35*,777 WARN
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
> hdfs://
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed, returning error
> java.io.IOException: Cannot obtain block length for
> LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
> getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
> 10.156.192.106:50010, 10.156.195.38:50010]}*
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
> >---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
> >---at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
> >---at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
> >---at java.lang.Thread.run(Thread.java:662)
>
>
>
> On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>>
>> I looked at it again with a fresh eye. As Varun was saying, the root cause
>> is the wrong order of the block locations.
>>
>> The root cause of the root cause is actually simple: HBASE started the
>> recovery while the node was not yet stale from an HDFS pov.
>>
>> Varun mentioned this timing:
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB