Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - RS crash upon replication


+
amit.mor.mail@...) 2013-05-22, 20:27
+
Varun Sharma 2013-05-22, 20:38
+
Varun Sharma 2013-05-22, 20:40
+
amit.mor.mail@...) 2013-05-22, 20:46
+
Ted Yu 2013-05-22, 20:49
+
amit.mor.mail@...) 2013-05-22, 20:53
+
Varun Sharma 2013-05-22, 21:05
+
Amit Mor 2013-05-22, 21:09
+
Varun Sharma 2013-05-22, 21:16
+
Varun Sharma 2013-05-22, 21:17
+
Varun Sharma 2013-05-22, 21:19
+
Varun Sharma 2013-05-22, 21:20
+
Amit Mor 2013-05-22, 21:22
+
Varun Sharma 2013-05-22, 21:32
+
Amit Mor 2013-05-22, 21:42
+
Amit Mor 2013-05-22, 22:00
+
Himanshu Vashishtha 2013-05-22, 22:02
+
Varun Sharma 2013-05-23, 00:01
Copy link to this message
-
Re: RS crash upon replication
Himanshu Vashishtha 2013-05-23, 00:40
That sounds like a bug for sure. Could you create a jira with logs/znode
dump/steps to reproduce it?

Thanks,
himanshu
On Wed, May 22, 2013 at 5:01 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> It seems I can reproduce this - I did a few rolling restarts and got
> screwed with NoNode exceptions - I am running 0.94.7 which has the fix but
> my nodes don't contain hyphens - nodes are no longer coming back up...
>
> Thanks
> Varun
>
>
> On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha <[EMAIL PROTECTED]
> >wrote:
>
> > I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't have
> it.
> >
> > With hyphens in the name, ReplicationSource gets confused and tried to
> set
> > data in a znode which doesn't exist.
> >
> > Thanks,
> > Himanshu
> >
> >
> > On Wed, May 22, 2013 at 2:42 PM, Amit Mor <[EMAIL PROTECTED]>
> wrote:
> >
> > > yes, indeed - hyphens are part of the host name (annoying legacy stuff
> in
> > > my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
> > > backported by Cloudera into their flavor of 0.94.2, but
> > > the mysterious occurrence of the percent sign in zkcli (ls
> > >
> > >
> >
> /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
> > > might be a sign for such problem. How deep should my rmr in zkcli (an
> > > example would be most welcomed :) be ? I have no serious problem
> running
> > > copyTable with a time period corresponding to the outage and then to
> > start
> > > the sync back again. One question though, how did it cause a crash ?
> > >
> > >
> > > On Thu, May 23, 2013 at 12:32 AM, Varun Sharma <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > I believe there were cascading failures which got these deep nodes
> > > > containing still to be replicated WAL(s) - I suspect there is either
> > some
> > > > parsing bug or something which is causing the replication source to
> not
> > > > work - also which version are you using - does it have
> > > > https://issues.apache.org/jira/browse/HBASE-8207 - since you use
> > hyphens
> > > > in
> > > > our paths. One way to get back up is to delete these nodes but then
> you
> > > > lose data in these WAL(s)...
> > > >
> > > >
> > > > On Wed, May 22, 2013 at 2:22 PM, Amit Mor <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > > >  va-p-hbase-02-d,60020,1369249862401
> > > > >
> > > > >
> > > > > On Thu, May 23, 2013 at 12:20 AM, Varun Sharma <
> [EMAIL PROTECTED]>
> > > > > wrote:
> > > > >
> > > > > > Basically
> > > > > >
> > > > > > ls /hbase/rs and what do you see for va-p-02-d ?
> > > > > >
> > > > > >
> > > > > > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <
> [EMAIL PROTECTED]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Can you do ls /hbase/rs and see what you get for 02-d - instead
> > of
> > > > > > looking
> > > > > > > in /replication/, could you look in /hbase/replication/rs - I
> > want
> > > to
> > > > > see
> > > > > > > if the timestamps are matching or not ?
> > > > > > >
> > > > > > > Varun
> > > > > > >
> > > > > > >
> > > > > > > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <
> > [EMAIL PROTECTED]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> I see - so looks okay - there's just a lot of deep nesting in
> > > there
> > > > -
> > > > > if
> > > > > > >> you look into these you nodes by doing ls - you should see a
> > bunch
> > > > of
> > > > > > >> WAL(s) which still need to be replicated...
> > > > > > >>
> > > > > > >> Varun
> > > > > > >>
> > > > > > >>
> > > > > > >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma <
> > > [EMAIL PROTECTED]
> > > > > > >wrote:
> > > > > > >>
> > > > > > >>> 2013-05-22 15:31:25,929 WARN
> > > > > > >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > Possibly
> > > > > > transient
> > > > > > >>> ZooKeeper exception:
> > > > > > >>> org.apache.zookeeper.KeeperException$SessionExpiredException:
+
Varun Sharma 2013-05-23, 07:33
+
Amit Mor 2013-05-23, 08:17
+
Jean-Daniel Cryans 2013-05-23, 16:48
+
Varun Sharma 2013-05-23, 16:53
+
Amit Mor 2013-05-22, 21:15
+
Amit Mor 2013-05-23, 17:43
+
Amit Mor 2013-05-23, 18:58