Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2


Copy link to this message
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.
Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
regionserver in question (node 169), so i'm guessing that's
22f8fa73d8af837410ca270f344f6bb8's mommy.

btw - you can see my balancer kick in 45 seconds later (runs every 10
minutes) here, but so far i think that's coincidence:
2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
BalanceSwitch=false

I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169
and found all this stuff about a failed split: http://pastebin.com/xtXMZ388 and
an attempted rollback.  Looks like it errors out when it goes to put a node
in ZK but it's already there.  I'm not familiar with what a good split log
looks like, so i'll stop commenting for now...
On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
[EMAIL PROTECTED]> wrote:

> The log shows that the first time the region was transitioned to SPLITTING
> even then it was not populated in the Master's memory.
>
> On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
> [EMAIL PROTECTED]> wrote:
>
> > Could you attach the master logs at this time
> > 2012-11-05 00:24:55?
> >
> > Regards
> > Ram
> >
> > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED]
> >wrote:
> >
> >> Took a brief look through all SPLIT related commits since 0.94.0...
> Found
> >> these:
> >>
> >> HBASE-6854 *
> >> HBASE-6713
> >> HBASE-6329 *
> >>
> >> HBASE-6088
> >>
> >> HBASE-5986
> >> HBASE-6070 *
> >>
> >>
> >> The ones marked with * are (IMHO) more likely to be related.
> >>
> >> -- Lars
> >>
> >> ________________________________
> >> From: Matt Corgan <[EMAIL PROTECTED]>
> >> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]>
> >> Sent: Monday, November 5, 2012 9:28 PM
> >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >>
> >> Yeah - we were running .94.0 since it came out but never saw it there.
> >> I'll keep trying to narrow it down.  The only harm it's causing is log
> >> spam and failing to move daughters to a new regionserver, which are
> >> definitely problems, but it's not bringing down the cluster.
> >>
> >>
> >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]>
> >> wrote:
> >>
> >> > So it seems you can repeat this to some extend in 0.94.2, but you have
> >> > never seen this before?
> >> >
> >> >
> >> > -- Lars
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >  From: Matt Corgan <[EMAIL PROTECTED]>
> >> > To: dev <[EMAIL PROTECTED]>
> >> > Sent: Monday, November 5, 2012 9:10 PM
> >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >> >
> >> > It happened in this new table that I have all the logs for.  The
> region
> >> in
> >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first
> >> trace
> >> > of this region is on the regionserver with the "Intantiated
> >> tableName..."
> >> > message
> >> >
> >> > 2012-11-05 22:24:21,162 DEBUG
> >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> > Instantiated
> >> >
> >> >
> >>
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> >> > 9784c21cbc2.
> >> >
> >> > I also know this region came from a recent split, but I can't find any
> >> log
> >> > messages show the parent finishing the split that created this
> daughter
> >> > region.  So my guess now is that the split is actually finishing and
> >> > letting clients continue to write data, but something is failing to
> >> print
> >> > the log line and correctly tell the master about the new region.
> >> >
> >> > I've noticed that these regions are showing up on the clients in calls
> >> to
> >> > HTable.getRegionLocations(), so the clients continue to function, but
> >> if I
> >> > call HBaseAdmin.move() i get an UnknownRegionException.
> >> >
> >> >
> >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: