Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RS crash upon replication


Copy link to this message
-
Re: RS crash upon replication
fwiw stop_replication is a kill switch, not a general way to start and
stop replicating, and start_replication may put you in an inconsistent
state:

hbase(main):001:0> help 'stop_replication'
Stops all the replication features. The state in which each
stream stops in is undetermined.
WARNING:
start/stop replication is only meant to be used in critical load situations.

On Thu, May 23, 2013 at 1:17 AM, Amit Mor <[EMAIL PROTECTED]> wrote:
> No the server came out fine just because after the crash (RS's - the
> masters were still running), I immediately pulled the breaks with
> stop_replication. Then I start the RS's and they came back fine (not
> replicating). Once I hit 'start_replication' again they had crashed for the
> second time. Eventually I deleted the heavily nested replication znodes and
> the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH
> with Cloudera Manager Parcels thing and I'm still trying to figure out how
> to replace their jars with mine in a clean and non intrusive way
>
>
> On Thu, May 23, 2013 at 10:33 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:
>
>> Actually, it seems like something else was wrong here - the servers came up
>> just fine on trying again - so could not really reproduce the issue.
>>
>> Amit: Did you try patching 8207 ?
>>
>> Varun
>>
>>
>> On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha <[EMAIL PROTECTED]
>> >wrote:
>>
>> > That sounds like a bug for sure. Could you create a jira with logs/znode
>> > dump/steps to reproduce it?
>> >
>> > Thanks,
>> > himanshu
>> >
>> >
>> > On Wed, May 22, 2013 at 5:01 PM, Varun Sharma <[EMAIL PROTECTED]>
>> wrote:
>> >
>> > > It seems I can reproduce this - I did a few rolling restarts and got
>> > > screwed with NoNode exceptions - I am running 0.94.7 which has the fix
>> > but
>> > > my nodes don't contain hyphens - nodes are no longer coming back up...
>> > >
>> > > Thanks
>> > > Varun
>> > >
>> > >
>> > > On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha <
>> [EMAIL PROTECTED]
>> > > >wrote:
>> > >
>> > > > I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't
>> have
>> > > it.
>> > > >
>> > > > With hyphens in the name, ReplicationSource gets confused and tried
>> to
>> > > set
>> > > > data in a znode which doesn't exist.
>> > > >
>> > > > Thanks,
>> > > > Himanshu
>> > > >
>> > > >
>> > > > On Wed, May 22, 2013 at 2:42 PM, Amit Mor <[EMAIL PROTECTED]>
>> > > wrote:
>> > > >
>> > > > > yes, indeed - hyphens are part of the host name (annoying legacy
>> > stuff
>> > > in
>> > > > > my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6
>> was
>> > > > > backported by Cloudera into their flavor of 0.94.2, but
>> > > > > the mysterious occurrence of the percent sign in zkcli (ls
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
>> > > > > might be a sign for such problem. How deep should my rmr in zkcli
>> (an
>> > > > > example would be most welcomed :) be ? I have no serious problem
>> > > running
>> > > > > copyTable with a time period corresponding to the outage and then
>> to
>> > > > start
>> > > > > the sync back again. One question though, how did it cause a crash
>> ?
>> > > > >
>> > > > >
>> > > > > On Thu, May 23, 2013 at 12:32 AM, Varun Sharma <
>> [EMAIL PROTECTED]>
>> > > > > wrote:
>> > > > >
>> > > > > > I believe there were cascading failures which got these deep
>> nodes
>> > > > > > containing still to be replicated WAL(s) - I suspect there is
>> > either
>> > > > some
>> > > > > > parsing bug or something which is causing the replication source
>> to
>> > > not
>> > > > > > work - also which version are you using - does it have
>> > > > > > https://issues.apache.org/jira/browse/HBASE-8207 - since you use
>> > > > hyphens
>> > > > > > in
>> > > > > > our paths. One way to get back up is to delete these nodes but
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB