Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2


Copy link to this message
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
I dug back further to find the origin of e11e8b030897d6e5b973f8fe892e0eb2
to see if it had previous problems that left ZK in a bad state.  Here's the
regionserver and master logs: http://pastebin.com/qcvHjNCg from about 2
hours earlier.

* Nov 4, 22:34: region is created as daugher b of a split on node 159
* Nov 4, 22:35: moved to from node 159 to 169 by HBaseAdmin.move()
* Nov 5, 00:24: node 169 tries to split the region but gets Failed create
of ephemeral /hbase/unassigned/e11e8b030897d6e5b973f8fe892e0eb2

Is it possible that if something calls HBaseAdmin.move() on a daugher
region that is 30 seconds old, it could move the region but leave that ZK
node in a bad state?

On Tue, Nov 6, 2012 at 7:13 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Thanks Ram!
>
>
>
>
>
> ----- Original Message -----
> From: ramkrishna vasudevan <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Tuesday, November 6, 2012 3:29 AM
> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>
> Raised HBASE-7103 for the same.
>
> Regards
> Ram
>
> On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan <
> [EMAIL PROTECTED]> wrote:
>
> > Thanks for the logs.
> > I found the reason.
> >
> > The following steps happen
> > -> Initially the parent region P1 starts splitting.
> > -> The split is going on normally.
> > -> Another split starts at the same time for the same region P1. (Not
> sure
> > why this started).
> > -> Rollback happens seeing an already existing node.
> > -> This node gets deleted in rollback and nodeDeleted Event starts.
> > -> In nodeDeleted event the RIT for the region P1 gets deleted.
> > -> Because of this there is no region in RIT.
> > -> Now the first split gets over.  Here the problem is we try to transit
> > the node to SPLITTING to SPLIT. But the node even does not exist.
> > But we don take any action on this.  We think it is successful.
> > -> Even before HBASE-6854 this could have happened.  Will file a JIRA for
> > the same.
> >
> > Regards
> > Ram
> >
> > On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> >
> >> Ram, here's the master log corresponding to
> http://pastebin.com/cSdMbA2a.
> >> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
> >> regionserver in question (node 169), so i'm guessing that's
> >> 22f8fa73d8af837410ca270f344f6bb8's mommy.
> >>
> >> btw - you can see my balancer kick in 45 seconds later (runs every 10
> >> minutes) here, but so far i think that's coincidence:
> >> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
> >> BalanceSwitch=false
> >>
> >> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node
> 169
> >> and found all this stuff about a failed split:
> >> http://pastebin.com/xtXMZ388 and
> >> an attempted rollback.  Looks like it errors out when it goes to put a
> >> node
> >> in ZK but it's already there.  I'm not familiar with what a good split
> log
> >> looks like, so i'll stop commenting for now...
> >>
> >>
> >> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
> >> [EMAIL PROTECTED]> wrote:
> >>
> >> > The log shows that the first time the region was transitioned to
> >> SPLITTING
> >> > even then it was not populated in the Master's memory.
> >> >
> >> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
> >> > [EMAIL PROTECTED]> wrote:
> >> >
> >> > > Could you attach the master logs at this time
> >> > > 2012-11-05 00:24:55?
> >> > >
> >> > > Regards
> >> > > Ram
> >> > >
> >> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED]
> >> > >wrote:
> >> > >
> >> > >> Took a brief look through all SPLIT related commits since 0.94.0...
> >> > Found
> >> > >> these:
> >> > >>
> >> > >> HBASE-6854 *
> >> > >> HBASE-6713
> >> > >> HBASE-6329 *
> >> > >>
> >> > >> HBASE-6088
> >> > >>
> >> > >> HBASE-5986
> >> > >> HBASE-6070 *
> >> > >>
> >> > >>
> >> > >> The ones marked with * are (IMHO) more likely to be related.