Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - RS crash upon replication


Copy link to this message
-
Re: RS crash upon replication
Jean-Daniel Cryans 2013-05-23, 16:48
fwiw stop_replication is a kill switch, not a general way to start and
stop replicating, and start_replication may put you in an inconsistent
state:

hbase(main):001:0> help 'stop_replication'
Stops all the replication features. The state in which each
stream stops in is undetermined.
WARNING:
start/stop replication is only meant to be used in critical load situations.

On Thu, May 23, 2013 at 1:17 AM, Amit Mor <[EMAIL PROTECTED]> wrote:
> No the server came out fine just because after the crash (RS's - the
> masters were still running), I immediately pulled the breaks with
> stop_replication. Then I start the RS's and they came back fine (not
> replicating). Once I hit 'start_replication' again they had crashed for the
> second time. Eventually I deleted the heavily nested replication znodes and
> the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH
> with Cloudera Manager Parcels thing and I'm still trying to figure out how
> to replace their jars with mine in a clean and non intrusive way
>
>
> On Thu, May 23, 2013 at 10:33 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:
>
>> Actually, it seems like something else was wrong here - the servers came up
>> just fine on trying again - so could not really reproduce the issue.
>>
>> Amit: Did you try patching 8207 ?
>>
>> Varun
>>
>>
>> On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha <[EMAIL PROTECTED]
>> >wrote:
>>
>> > That sounds like a bug for sure. Could you create a jira with logs/znode
>> > dump/steps to reproduce it?
>> >
>> > Thanks,
>> > himanshu
>> >
>> >
>> > On Wed, May 22, 2013 at 5:01 PM, Varun Sharma <[EMAIL PROTECTED]>
>> wrote:
>> >
>> > > It seems I can reproduce this - I did a few rolling restarts and got
>> > > screwed with NoNode exceptions - I am running 0.94.7 which has the fix
>> > but
>> > > my nodes don't contain hyphens - nodes are no longer coming back up...
>> > >
>> > > Thanks
>> > > Varun
>> > >
>> > >
>> > > On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha <
>> [EMAIL PROTECTED]
>> > > >wrote:
>> > >
>> > > > I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't
>> have
>> > > it.
>> > > >
>> > > > With hyphens in the name, ReplicationSource gets confused and tried
>> to
>> > > set
>> > > > data in a znode which doesn't exist.
>> > > >
>> > > > Thanks,
>> > > > Himanshu
>> > > >
>> > > >
>> > > > On Wed, May 22, 2013 at 2:42 PM, Amit Mor <[EMAIL PROTECTED]>
>> > > wrote:
>> > > >
>> > > > > yes, indeed - hyphens are part of the host name (annoying legacy
>> > stuff
>> > > in
>> > > > > my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6
>> was
>> > > > > backported by Cloudera into their flavor of 0.94.2, but
>> > > > > the mysterious occurrence of the percent sign in zkcli (ls
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
>> > > > > might be a sign for such problem. How deep should my rmr in zkcli
>> (an
>> > > > > example would be most welcomed :) be ? I have no serious problem
>> > > running
>> > > > > copyTable with a time period corresponding to the outage and then
>> to
>> > > > start
>> > > > > the sync back again. One question though, how did it cause a crash
>> ?
>> > > > >
>> > > > >
>> > > > > On Thu, May 23, 2013 at 12:32 AM, Varun Sharma <
>> [EMAIL PROTECTED]>
>> > > > > wrote:
>> > > > >
>> > > > > > I believe there were cascading failures which got these deep
>> nodes
>> > > > > > containing still to be replicated WAL(s) - I suspect there is
>> > either
>> > > > some
>> > > > > > parsing bug or something which is causing the replication source
>> to
>> > > not
>> > > > > > work - also which version are you using - does it have
>> > > > > > https://issues.apache.org/jira/browse/HBASE-8207 - since you use
>> > > > hyphens
>> > > > > > in
>> > > > > > our paths. One way to get back up is to delete these nodes but