Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Replication hosed after simple cluster restart


Copy link to this message
-
Re: Replication hosed after simple cluster restart
I have proposed some minor changes (including adding the jitter) on HBASE-8099.
Turns out there already is a wait-time to give the cluster a chance to shutdown. It defaults to 2s, which was not enough in our case.

Let's do a test (if we think that can be done) in a different jira.
-- Lars
________________________________
From: Himanshu Vashishtha <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
Sent: Wednesday, March 13, 2013 8:59 PM
Subject: Re: Replication hosed after simple cluster restart

On Wed, Mar 13, 2013 at 8:48 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> Yeah, lemme sink the RC... We do have a fix.
>
>
> Consider it sunk.
>
> In the end there are some more issues to discuss anyway.
> - Can we avoid RSs taking over queues during a clean shutdown/restart? Without multi we can actually loose data to replicate this way (one RS is shut down, another takes over and is itself shut down) - unless I misunderstand.

I agree. because even if they do move, they are not using locality as
the regionserver which eventually takes it over will remotely read the
log files. One way I can think of is do a scan on the available
regionservers in the /hbase/rs znodes and then decide whether it
should start the failover processing.

>
> - Should we stagger the attempts to move the queues for example with a random wait between 0 and 10s, so that not all RSs try at the same time?
> - A test for this scenario? (That's probably tricky)

How about adding a jitter (random sleep (0-10 sec]) in the run method
of the NodeFailoverWorker before it actually starts the failover
processing? I will try to come up with a test case.

>
>
> -- Lars
>
>
>
> ________________________________
>  From: Andrew Purtell <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Wednesday, March 13, 2013 8:22 PM
> Subject: Re: Replication hosed after simple cluster restart
>
> If Himanshu (?) can fix it quickly we should try to get it in here IMHO.
>
> On Wednesday, March 13, 2013, Ted Yu wrote:
>
>> This was the JIRA that introduced copyQueuesFromRSUsingMulti():
>> HBASE-2611 Handle RS that fails while processing the failure of another one
>> (Himanshu Vashishtha)
>>
>> It went into 0.94.5
>> And the feature is off by default:
>>
>>     <name>hbase.zookeeper.useMulti</name>
>>     <value>false</value>
>>
>> The fact that Lars first reported the following problem meant that no other
>> user tried this feature.
>>
>> Hence I think 0.94.6 RC1 doesn't need to be sunk.
>>
>> Cheers
>>
>> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <[EMAIL PROTECTED]<javascript:;>>
>> wrote:
>>
>> > Hey no problem. It's cool that we found it in a test env. It's probably
>> > quite hard to reproduce.
>> > This is in 0.94.5 but this feature is off by default.
>> >
>> > What's the general thought here, should I kill the current 0.94.6 rc for
>> > this?
>> > My gut says: Yes.
>> >
>> >
>> > I'm also a bit worried about these:
>> > 2013-03-14 01:42:42,271 DEBUG
>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Opening
>> > log for replication shared-dnds1-12-sfm.ops.sfdc.net
>> %2C60020%2C1363220608780.1363220609572
>> > at 0
>> > 2013-03-14 01:42:42,358 WARN
>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1
>> Got:
>> > java.io.EOFException
>> >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>> >         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>> >         at
>> >
>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>> >         at
>> >
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB