Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Replication hosed after simple cluster restart


+
lars hofhansl 2013-03-14, 01:12
+
lars hofhansl 2013-03-14, 01:22
+
Stack 2013-03-14, 01:43
+
lars hofhansl 2013-03-14, 01:52
+
Ted Yu 2013-03-14, 01:17
+
lars hofhansl 2013-03-14, 01:27
+
Himanshu Vashishtha 2013-03-14, 01:38
Copy link to this message
-
Re: Replication hosed after simple cluster restart
lars hofhansl 2013-03-14, 01:45
Hey no problem. It's cool that we found it in a test env. It's probably quite hard to reproduce.
This is in 0.94.5 but this feature is off by default.

What's the general thought here, should I kill the current 0.94.6 rc for this?
My gut says: Yes.

I'm also a bit worried about these:
2013-03-14 01:42:42,271 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0
2013-03-14 01:42:42,358 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got:
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
        at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
2013-03-14 01:42:42,358 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited too long for this file, considering dumping
2013-03-14 01:42:42,358 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable to open a reader, sleeping 1000 times 10
This happens after bouncing the cluster a 2nd time and these messages repeat every 10s (for hours now). This is a separate problem I think.
-- Lars

________________________________
 From: Himanshu Vashishtha <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
Cc: Ted Yu <[EMAIL PROTECTED]>
Sent: Wednesday, March 13, 2013 6:38 PM
Subject: Re: Replication hosed after simple cluster restart
 
This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
might not be able to move later on, resulting in bogus znodes.
I'll fix this asap. Weird it didn't happen in my testing earlier.
Sorry about this.

On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> Sorry 0.94.6RC1
> (I complain about folks not reporting the version all the time, and then I do it too)
>
>
>
> ________________________________
>  From: Ted Yu <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> Sent: Wednesday, March 13, 2013 6:17 PM
> Subject: Re: Replication hosed after simple cluster restart
>
>
> Did this happen on 0.94.5 ?
>
> Thanks
>
>
> On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> We just ran into an interesting scenario. We restarted a cluster that was setup as a replication source.
>>The stop went cleanly.
>>
>>Upon restart *all* regionservers aborted within a few seconds with variations of these errors:
>>http://pastebin.com/3iQVuBqS
>>
>>This is scary!
>>
>>-- Lars
+
Ted Yu 2013-03-14, 03:06
+
Andrew Purtell 2013-03-14, 03:22
+
Himanshu Vashishtha 2013-03-14, 03:26
+
lars hofhansl 2013-03-14, 03:48
+
Himanshu Vashishtha 2013-03-14, 03:59
+
lars hofhansl 2013-03-14, 04:16
+
Jean-Marc Spaggiari 2013-03-14, 12:39
+
Himanshu Vashishtha 2013-03-14, 03:13
+
Himanshu Vashishtha 2013-03-14, 01:51
+
Himanshu Vashishtha 2013-03-14, 01:52