Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> Replication hosed after simple cluster restart


+
lars hofhansl 2013-03-14, 01:12
+
lars hofhansl 2013-03-14, 01:22
+
Stack 2013-03-14, 01:43
+
lars hofhansl 2013-03-14, 01:52
+
Ted Yu 2013-03-14, 01:17
+
lars hofhansl 2013-03-14, 01:27
+
Himanshu Vashishtha 2013-03-14, 01:38
+
lars hofhansl 2013-03-14, 01:45
+
Ted Yu 2013-03-14, 03:06
+
Andrew Purtell 2013-03-14, 03:22
+
Himanshu Vashishtha 2013-03-14, 03:26
+
lars hofhansl 2013-03-14, 03:48
+
Himanshu Vashishtha 2013-03-14, 03:59
+
lars hofhansl 2013-03-14, 04:16
Copy link to this message
-
Re: Replication hosed after simple cluster restart
I agree that we should  try to add a test for this. A chance Lars you
found it, but will be even better if the test suite can detect such
things.

Does any-one have opened the JIRA for the test part?

JM

2013/3/14 lars hofhansl <[EMAIL PROTECTED]>:
> I have proposed some minor changes (including adding the jitter) on HBASE-8099.
> Turns out there already is a wait-time to give the cluster a chance to shutdown. It defaults to 2s, which was not enough in our case.
>
> Let's do a test (if we think that can be done) in a different jira.
>
>
> -- Lars
> ________________________________
> From: Himanshu Vashishtha <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> Sent: Wednesday, March 13, 2013 8:59 PM
> Subject: Re: Replication hosed after simple cluster restart
>
> On Wed, Mar 13, 2013 at 8:48 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>> Yeah, lemme sink the RC... We do have a fix.
>>
>>
>> Consider it sunk.
>>
>> In the end there are some more issues to discuss anyway.
>> - Can we avoid RSs taking over queues during a clean shutdown/restart? Without multi we can actually loose data to replicate this way (one RS is shut down, another takes over and is itself shut down) - unless I misunderstand.
>
> I agree. because even if they do move, they are not using locality as
> the regionserver which eventually takes it over will remotely read the
> log files. One way I can think of is do a scan on the available
> regionservers in the /hbase/rs znodes and then decide whether it
> should start the failover processing.
>
>>
>> - Should we stagger the attempts to move the queues for example with a random wait between 0 and 10s, so that not all RSs try at the same time?
>> - A test for this scenario? (That's probably tricky)
>
> How about adding a jitter (random sleep (0-10 sec]) in the run method
> of the NodeFailoverWorker before it actually starts the failover
> processing? I will try to come up with a test case.
>
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Andrew Purtell <[EMAIL PROTECTED]>
>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>> Sent: Wednesday, March 13, 2013 8:22 PM
>> Subject: Re: Replication hosed after simple cluster restart
>>
>> If Himanshu (?) can fix it quickly we should try to get it in here IMHO.
>>
>> On Wednesday, March 13, 2013, Ted Yu wrote:
>>
>>> This was the JIRA that introduced copyQueuesFromRSUsingMulti():
>>> HBASE-2611 Handle RS that fails while processing the failure of another one
>>> (Himanshu Vashishtha)
>>>
>>> It went into 0.94.5
>>> And the feature is off by default:
>>>
>>>     <name>hbase.zookeeper.useMulti</name>
>>>     <value>false</value>
>>>
>>> The fact that Lars first reported the following problem meant that no other
>>> user tried this feature.
>>>
>>> Hence I think 0.94.6 RC1 doesn't need to be sunk.
>>>
>>> Cheers
>>>
>>> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <[EMAIL PROTECTED]<javascript:;>>
>>> wrote:
>>>
>>> > Hey no problem. It's cool that we found it in a test env. It's probably
>>> > quite hard to reproduce.
>>> > This is in 0.94.5 but this feature is off by default.
>>> >
>>> > What's the general thought here, should I kill the current 0.94.6 rc for
>>> > this?
>>> > My gut says: Yes.
>>> >
>>> >
>>> > I'm also a bit worried about these:
>>> > 2013-03-14 01:42:42,271 DEBUG
>>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>>> Opening
>>> > log for replication shared-dnds1-12-sfm.ops.sfdc.net
>>> %2C60020%2C1363220608780.1363220609572
>>> > at 0
>>> > 2013-03-14 01:42:42,358 WARN
>>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1
>>> Got:
>>> > java.io.EOFException
>>> >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>> >         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>> >         at
>>> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>>> >         at
>>> >
>>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
+
Himanshu Vashishtha 2013-03-14, 03:13
+
Himanshu Vashishtha 2013-03-14, 01:51
+
Himanshu Vashishtha 2013-03-14, 01:52
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB