Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> Replication hosed after simple cluster restart


+
lars hofhansl 2013-03-14, 01:12
+
lars hofhansl 2013-03-14, 01:22
+
Stack 2013-03-14, 01:43
+
lars hofhansl 2013-03-14, 01:52
+
Ted Yu 2013-03-14, 01:17
+
lars hofhansl 2013-03-14, 01:27
+
Himanshu Vashishtha 2013-03-14, 01:38
+
lars hofhansl 2013-03-14, 01:45
+
Ted Yu 2013-03-14, 03:06
+
Andrew Purtell 2013-03-14, 03:22
+
Himanshu Vashishtha 2013-03-14, 03:26
+
lars hofhansl 2013-03-14, 03:48
Copy link to this message
-
Re: Replication hosed after simple cluster restart
On Wed, Mar 13, 2013 at 8:48 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> Yeah, lemme sink the RC... We do have a fix.
>
>
> Consider it sunk.
>
> In the end there are some more issues to discuss anyway.
> - Can we avoid RSs taking over queues during a clean shutdown/restart? Without multi we can actually loose data to replicate this way (one RS is shut down, another takes over and is itself shut down) - unless I misunderstand.

I agree. because even if they do move, they are not using locality as
the regionserver which eventually takes it over will remotely read the
log files. One way I can think of is do a scan on the available
regionservers in the /hbase/rs znodes and then decide whether it
should start the failover processing.

>
> - Should we stagger the attempts to move the queues for example with a random wait between 0 and 10s, so that not all RSs try at the same time?
> - A test for this scenario? (That's probably tricky)

How about adding a jitter (random sleep (0-10 sec]) in the run method
of the NodeFailoverWorker before it actually starts the failover
processing? I will try to come up with a test case.

>
>
> -- Lars
>
>
>
> ________________________________
>  From: Andrew Purtell <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Wednesday, March 13, 2013 8:22 PM
> Subject: Re: Replication hosed after simple cluster restart
>
> If Himanshu (?) can fix it quickly we should try to get it in here IMHO.
>
> On Wednesday, March 13, 2013, Ted Yu wrote:
>
>> This was the JIRA that introduced copyQueuesFromRSUsingMulti():
>> HBASE-2611 Handle RS that fails while processing the failure of another one
>> (Himanshu Vashishtha)
>>
>> It went into 0.94.5
>> And the feature is off by default:
>>
>>     <name>hbase.zookeeper.useMulti</name>
>>     <value>false</value>
>>
>> The fact that Lars first reported the following problem meant that no other
>> user tried this feature.
>>
>> Hence I think 0.94.6 RC1 doesn't need to be sunk.
>>
>> Cheers
>>
>> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <[EMAIL PROTECTED]<javascript:;>>
>> wrote:
>>
>> > Hey no problem. It's cool that we found it in a test env. It's probably
>> > quite hard to reproduce.
>> > This is in 0.94.5 but this feature is off by default.
>> >
>> > What's the general thought here, should I kill the current 0.94.6 rc for
>> > this?
>> > My gut says: Yes.
>> >
>> >
>> > I'm also a bit worried about these:
>> > 2013-03-14 01:42:42,271 DEBUG
>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Opening
>> > log for replication shared-dnds1-12-sfm.ops.sfdc.net
>> %2C60020%2C1363220608780.1363220609572
>> > at 0
>> > 2013-03-14 01:42:42,358 WARN
>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1
>> Got:
>> > java.io.EOFException
>> >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>> >         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>> >         at
>> >
>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>> >         at
>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>> >         at
>> >
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>> >         at
>> >
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>> >         at
>> > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
>> >         at
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>> >         at
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
>> >         at
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
+
lars hofhansl 2013-03-14, 04:16
+
Jean-Marc Spaggiari 2013-03-14, 12:39
+
Himanshu Vashishtha 2013-03-14, 03:13
+
Himanshu Vashishtha 2013-03-14, 01:51
+
Himanshu Vashishtha 2013-03-14, 01:52