Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RS crash upon replication


Copy link to this message
-
Re: RS crash upon replication
2013-05-22 15:31:25,929 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for *
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
*
*
*
*01->[01->02->02]->01*

*Looks like a bunch of cascading failures causing this deep nesting... *
On Wed, May 22, 2013 at 2:09 PM, Amit Mor <[EMAIL PROTECTED]> wrote:

> empty return:
>
> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> []
>
>
>
> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
>
> > Do an "ls" not a get here and give the output ?
> >
> > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >
> >
> > On Wed, May 22, 2013 at 1:53 PM, [EMAIL PROTECTED] <
> > [EMAIL PROTECTED]> wrote:
> >
> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > >
> > > cZxid = 0x60281c1de
> > > ctime = Wed May 22 15:11:17 EDT 2013
> > > mZxid = 0x60281c1de
> > > mtime = Wed May 22 15:11:17 EDT 2013
> > > pZxid = 0x60281c1de
> > > cversion = 0
> > > dataVersion = 0
> > > aclVersion = 0
> > > ephemeralOwner = 0x0
> > > dataLength = 0
> > > numChildren = 0
> > >
> > >
> > >
> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > What does this command show you ?
> > > >
> > > > get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > > >
> > > > Cheers
> > > >
> > > > On Wed, May 22, 2013 at 1:46 PM, [EMAIL PROTECTED] <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> > > > > [1]
> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
> > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> > > > > []
> > > > >
> > > > > I'm on hbase-0.94.2-cdh4.2.1
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma <
> [EMAIL PROTECTED]>
> > > > > wrote:
> > > > >
> > > > > > Also what version of HBase are you running ?
> > > > > >
> > > > > >
> > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma <
> [EMAIL PROTECTED]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Basically,
> > > > > > >
> > > > > > > You had va-p-hbase-02 crash - that caused all the replication
> > > related
> > > > > > data
> > > > > > > in zookeeper to be moved to va-p-hbase-01 and have it take over
> > for
> > > > > > > replicating 02's logs. Now each region server also maintains an
> > > > > in-memory
> > > > > > > state of whats in ZK, it seems like when you start up 01, its
> > > trying
> > > > to
> > > > > > > replicate the 02 logs underneath but its failing to because
> that
> > > data
> > > > > is
> > > > > > > not in ZK. This is somewhat weird...
> > > > > > >
> > > > > > > Can you open the zookeepeer shell and do
> > > > > > >
> > > > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> > > > > > >
> > > > > > > And give the output ?
> > > > > > >
> > > > > > >
> > > > > > > On Wed, May 22, 2013 at 1:27 PM, [EMAIL PROTECTED] <
> > > > > > > [EMAIL PROTECTED]> wrote:
> > > > > > >
> > > > > > >> Hi,
> > > > > > >>
> > > > > > >> This is bad ... and happened twice: I had my replication-slave
> > > > cluster
> > > > > > >> offlined. I performed quite a massive Merge operation on it
> and
> > > > after
> > > > > a
> > > > > > >> couple of hours it had finished and I returned it back online.
> > At
> > > > the
> > > > > > same
> > > > > > >> time, the replication-master RS machines crashed (see first
> > crash
> > > > > > >> http://pastebin.com/1msNZ2tH) with the first exception being:
> > > > > > >>
> > > > > > >> org.apache.zookeeper.KeeperException$NoNodeException: