Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
The important thing to note is the block for this rogue WAL is
UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node
thing kicks in correctly for UNDER_RECOVERY blocks but failed.
On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> Hi Nicholas,
>
> Regarding the following, I think this is not a recovery - the file below
> is an HFIle and is being accessed on a get request. On this cluster, I
> don't have block locality. I see these exceptions for a while and then they
> are gone, which means the stale node thing kicks in.
>
> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.156.194.94:50010 for file
> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
> 02f639bb43944d4ba9abcf58287831c0
> for block
>
> This is the real bummer. The stale datanode is 1st even 90 seconds
> afterwards.
>
> *2013-04-19 00:28:35*,777 WARN
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
> hdfs://
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed, returning error
> java.io.IOException: Cannot obtain block length for
> LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
> getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
> 10.156.192.106:50010, 10.156.195.38:50010]}*
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
> >---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
> >---at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
> >---at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
> >---at java.lang.Thread.run(Thread.java:662)
>
>
>
> On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <[EMAIL PROTECTED]>wrote:
>
>> Hi,
>>
>> I looked at it again with a fresh eye. As Varun was saying, the root cause
>> is the wrong order of the block locations.
>>
>> The root cause of the root cause is actually simple: HBASE started the
>> recovery while the node was not yet stale from an HDFS pov.
>>
>> Varun mentioned this timing:
>