Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Slow region server recoveries


Copy link to this message
-
Re: Slow region server recoveries
Is there a place to upload these logs ?
On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> Hi Nicholas,
>
> Attached are the namenode, dn logs (of one of the healthy replicas of the
> WAL block) and the rs logs which got stuch doing the log split. Action
> begins at 2013-04-19 00:27*.
>
> Also, the rogue block is 5723958680970112840_174056. Its very interesting
> to trace this guy through the HDFS logs (dn and nn).
>
> Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also does
> the stale node stuff kick in for that state ?
>
> Thanks
> Varun
>
>
> On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <[EMAIL PROTECTED]>wrote:
>
>> Thanks for the detailed scenario and analysis. I'm going to have a look.
>> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
>> timeouts), could you please send them directly to me?
>>
>> Thanks,
>>
>> Nicolas
>>
>>
>> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <[EMAIL PROTECTED]>
>> wrote:
>>
>> > Hi Nicholas,
>> >
>> > Here is the failure scenario, I have dug up the logs.
>> >
>> > A machine fails and stops accepting/transmitting traffic. The HMaster
>> > starts the distributed split for 13 tasks. There are 12 region servers.
>> 12
>> > tasks succeed but the 13th one takes a looong time.
>> >
>> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
>> seconds.
>> > Both patches are there.
>> >
>> > a) Machine fails around 27:30
>> > b) Master starts the split around 27:40 and submits the tasks. The one
>> task
>> > which fails seems to be the one which contains the WAL being currently
>> > written to:
>> >
>> > 2013-04-19 00:27:44,325 INFO
>> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
>> > hdfs://
>> >
>> >
>> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > ,
>> > length=0
>> >
>> > Basically this region server picks up the task but finds the length of
>> this
>> > file to be 0 and drops. This happens again
>> >
>> > c) Finally another region server picks up the task but it ends up going
>> to
>> > the bad datanode which should not happen because of the stale node
>> timeout)
>> > Unfortunately it hits the 45 retries and a connect timeout of 20 seconds
>> > every time. This delays recovery significantly. Now I guess reducing #
>> of
>> > retries to 1 is one possibility.
>> > But then the namenode logs are very interesting.
>> >
>> > d) Namenode seems to be in cyclic lease recovery loop until the node is
>> > marked dead. There is this one last block which exhibits this.
>> >
>> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
>> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
>> > primaryNodeIndex=1,
>> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started,
>> > primary=10.156.192.106:50010
>> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR*
>> > NameSystem.internalReleaseLease: File
>> >
>> >
>> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > has not been closed. Lease recovery is in progress. RecoveryId = 174413
>> for
>> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
>> > primaryNodeIndex=1,
>> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>> >
>> > I see this over and over again in the logs until the datanode is marked
>> > dead. It seems to be cycling through the replicas for this WAL block and
>> > trying to add it to the recovery list. I looked at the code and it says:
>> >
>> >       // Cannot close file right now, since the last block requires
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB