Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2


+
lars hofhansl 2012-11-04, 00:37
+
rajesh babu chintaguntla 2012-11-04, 02:10
+
Matt Corgan 2012-11-04, 02:27
+
Ted Yu 2012-11-04, 22:07
+
Matt Corgan 2012-11-05, 01:16
+
Jean-Daniel Cryans 2012-11-05, 17:52
+
Matt Corgan 2012-11-05, 22:02
+
Jean-Daniel Cryans 2012-11-05, 22:15
+
Matt Corgan 2012-11-06, 01:33
+
Ted Yu 2012-11-06, 03:07
+
Matt Corgan 2012-11-06, 05:10
+
lars hofhansl 2012-11-06, 05:17
+
Matt Corgan 2012-11-06, 05:28
+
lars hofhansl 2012-11-06, 05:45
+
ramkrishna vasudevan 2012-11-06, 05:59
+
ramkrishna vasudevan 2012-11-06, 06:30
+
Matt Corgan 2012-11-06, 08:12
+
ramkrishna vasudevan 2012-11-06, 10:07
Copy link to this message
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
ramkrishna vasudevan 2012-11-06, 11:29
Raised HBASE-7103 for the same.

Regards
Ram

On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan <
[EMAIL PROTECTED]> wrote:

> Thanks for the logs.
> I found the reason.
>
> The following steps happen
> -> Initially the parent region P1 starts splitting.
> -> The split is going on normally.
> -> Another split starts at the same time for the same region P1. (Not sure
> why this started).
> -> Rollback happens seeing an already existing node.
> -> This node gets deleted in rollback and nodeDeleted Event starts.
> -> In nodeDeleted event the RIT for the region P1 gets deleted.
> -> Because of this there is no region in RIT.
> -> Now the first split gets over.  Here the problem is we try to transit
> the node to SPLITTING to SPLIT. But the node even does not exist.
> But we don take any action on this.  We think it is successful.
> -> Even before HBASE-6854 this could have happened.  Will file a JIRA for
> the same.
>
> Regards
> Ram
>
> On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
>
>> Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.
>> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
>> regionserver in question (node 169), so i'm guessing that's
>> 22f8fa73d8af837410ca270f344f6bb8's mommy.
>>
>> btw - you can see my balancer kick in 45 seconds later (runs every 10
>> minutes) here, but so far i think that's coincidence:
>> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
>> BalanceSwitch=false
>>
>> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169
>> and found all this stuff about a failed split:
>> http://pastebin.com/xtXMZ388 and
>> an attempted rollback.  Looks like it errors out when it goes to put a
>> node
>> in ZK but it's already there.  I'm not familiar with what a good split log
>> looks like, so i'll stop commenting for now...
>>
>>
>> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
>> [EMAIL PROTECTED]> wrote:
>>
>> > The log shows that the first time the region was transitioned to
>> SPLITTING
>> > even then it was not populated in the Master's memory.
>> >
>> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
>> > [EMAIL PROTECTED]> wrote:
>> >
>> > > Could you attach the master logs at this time
>> > > 2012-11-05 00:24:55?
>> > >
>> > > Regards
>> > > Ram
>> > >
>> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED]
>> > >wrote:
>> > >
>> > >> Took a brief look through all SPLIT related commits since 0.94.0...
>> > Found
>> > >> these:
>> > >>
>> > >> HBASE-6854 *
>> > >> HBASE-6713
>> > >> HBASE-6329 *
>> > >>
>> > >> HBASE-6088
>> > >>
>> > >> HBASE-5986
>> > >> HBASE-6070 *
>> > >>
>> > >>
>> > >> The ones marked with * are (IMHO) more likely to be related.
>> > >>
>> > >> -- Lars
>> > >>
>> > >> ________________________________
>> > >> From: Matt Corgan <[EMAIL PROTECTED]>
>> > >> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]>
>> > >> Sent: Monday, November 5, 2012 9:28 PM
>> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> > >>
>> > >> Yeah - we were running .94.0 since it came out but never saw it
>> there.
>> > >> I'll keep trying to narrow it down.  The only harm it's causing is
>> log
>> > >> spam and failing to move daughters to a new regionserver, which are
>> > >> definitely problems, but it's not bringing down the cluster.
>> > >>
>> > >>
>> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]>
>> > >> wrote:
>> > >>
>> > >> > So it seems you can repeat this to some extend in 0.94.2, but you
>> have
>> > >> > never seen this before?
>> > >> >
>> > >> >
>> > >> > -- Lars
>> > >> >
>> > >> >
>> > >> >
>> > >> > ________________________________
>> > >> >  From: Matt Corgan <[EMAIL PROTECTED]>
>> > >> > To: dev <[EMAIL PROTECTED]>
>> > >> > Sent: Monday, November 5, 2012 9:10 PM
>> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>>
+
lars hofhansl 2012-11-06, 15:13
+
Matt Corgan 2012-11-06, 23:36
+
Matt Corgan 2012-11-05, 01:20
+
ramkrishna vasudevan 2012-11-04, 17:12