Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Replication hosed after simple cluster restart


Copy link to this message
-
Re: Replication hosed after simple cluster restart
Yeah, lemme sink the RC... We do have a fix.
Consider it sunk.

In the end there are some more issues to discuss anyway.
- Can we avoid RSs taking over queues during a clean shutdown/restart? Without multi we can actually loose data to replicate this way (one RS is shut down, another takes over and is itself shut down) - unless I misunderstand.

- Should we stagger the attempts to move the queues for example with a random wait between 0 and 10s, so that not all RSs try at the same time?
- A test for this scenario? (That's probably tricky)
-- Lars

________________________________
 From: Andrew Purtell <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Sent: Wednesday, March 13, 2013 8:22 PM
Subject: Re: Replication hosed after simple cluster restart
 
If Himanshu (?) can fix it quickly we should try to get it in here IMHO.

On Wednesday, March 13, 2013, Ted Yu wrote:

> This was the JIRA that introduced copyQueuesFromRSUsingMulti():
> HBASE-2611 Handle RS that fails while processing the failure of another one
> (Himanshu Vashishtha)
>
> It went into 0.94.5
> And the feature is off by default:
>
>     <name>hbase.zookeeper.useMulti</name>
>     <value>false</value>
>
> The fact that Lars first reported the following problem meant that no other
> user tried this feature.
>
> Hence I think 0.94.6 RC1 doesn't need to be sunk.
>
> Cheers
>
> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <[EMAIL PROTECTED]<javascript:;>>
> wrote:
>
> > Hey no problem. It's cool that we found it in a test env. It's probably
> > quite hard to reproduce.
> > This is in 0.94.5 but this feature is off by default.
> >
> > What's the general thought here, should I kill the current 0.94.6 rc for
> > this?
> > My gut says: Yes.
> >
> >
> > I'm also a bit worried about these:
> > 2013-03-14 01:42:42,271 DEBUG
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening
> > log for replication shared-dnds1-12-sfm.ops.sfdc.net
> %2C60020%2C1363220608780.1363220609572
> > at 0
> > 2013-03-14 01:42:42,358 WARN
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1
> Got:
> > java.io.EOFException
> >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
> >         at java.io.DataInputStream.readFully(DataInputStream.java:152)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
> >         at
> >
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
> >         at
> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >         at
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
> >         at
> > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
> >         at
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
> >         at
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
> >         at
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
> > 2013-03-14 01:42:42,358 WARN
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Waited
> > too long for this file, considering dumping
> > 2013-03-14 01:42:42,358 DEBUG
> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Unable
> > to open a reader, sleeping 1000 times 10
> >
> > This happens after bouncing the cluster a 2nd time and these messages
> > repeat every 10s (for hours now). This is a separate problem I think.
> >
> > -- Lars
> >
> >   ------------------------------
> > *From:* Himanshu Vashishtha <[EMAIL PROTECTED] <javascript:;>>
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB