Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Replication hosed after simple cluster restart


+
lars hofhansl 2013-03-14, 01:12
+
lars hofhansl 2013-03-14, 01:22
+
Stack 2013-03-14, 01:43
+
lars hofhansl 2013-03-14, 01:52
+
Ted Yu 2013-03-14, 01:17
+
lars hofhansl 2013-03-14, 01:27
+
Himanshu Vashishtha 2013-03-14, 01:38
+
lars hofhansl 2013-03-14, 01:45
+
Ted Yu 2013-03-14, 03:06
+
Andrew Purtell 2013-03-14, 03:22
+
Himanshu Vashishtha 2013-03-14, 03:26
+
lars hofhansl 2013-03-14, 03:48
+
Himanshu Vashishtha 2013-03-14, 03:59
+
lars hofhansl 2013-03-14, 04:16
+
Jean-Marc Spaggiari 2013-03-14, 12:39
+
Himanshu Vashishtha 2013-03-14, 03:13
+
Himanshu Vashishtha 2013-03-14, 01:51
Copy link to this message
-
Re: Replication hosed after simple cluster restart
Himanshu Vashishtha 2013-03-14, 01:52
Err.. The jira is https://issues.apache.org/jira/browse/HBASE-7122

On Wed, Mar 13, 2013 at 6:51 PM, Himanshu Vashishtha
<[EMAIL PROTECTED]> wrote:
> The log message you are seeing have been there for a long time I
> remember (it is a debug level message).
> I had a patch long back
> https://issues.apache.org/jira/browse/HBASE-7937, which became stale.
>
> Stack: it is not the fault of the multi command, it is the way the
> code is using it is wrong. There is a race b/w reading and moving the
> znodes. Basically, what should be done is in case a regionserver fails
> to move the znodes, it should return an empty list, and NOT what it
> has read earlier. This is because other regionsever might have moved
> the znodes.
>
> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>> Hey no problem. It's cool that we found it in a test env. It's probably
>> quite hard to reproduce.
>> This is in 0.94.5 but this feature is off by default.
>>
>> What's the general thought here, should I kill the current 0.94.6 rc for
>> this?
>> My gut says: Yes.
>>
>>
>> I'm also a bit worried about these:
>> 2013-03-14 01:42:42,271 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
>> log for replication
>> shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0
>> 2013-03-14 01:42:42,358 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got:
>> java.io.EOFException
>>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
>> 2013-03-14 01:42:42,358 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
>> too long for this file, considering dumping
>> 2013-03-14 01:42:42,358 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
>> to open a reader, sleeping 1000 times 10
>>
>> This happens after bouncing the cluster a 2nd time and these messages repeat
>> every 10s (for hours now). This is a separate problem I think.
>>
>> -- Lars
>>
>> ________________________________
>> From: Himanshu Vashishtha <[EMAIL PROTECTED]>
>>
>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
>> Cc: Ted Yu <[EMAIL PROTECTED]>
>> Sent: Wednesday, March 13, 2013 6:38 PM
>>
>> Subject: Re: Replication hosed after simple cluster restart
>>
>> This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
>> might not be able to move later on, resulting in bogus znodes.
>> I'll fix this asap. Weird it didn't happen in my testing earlier.
>> Sorry about this.
>>
>> On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>> Sorry 0.94.6RC1
>>> (I complain about folks not reporting the version all the time, and then I
>>> do it too)
>>>
>>>
>>>
>>> ________________________________
>>>  From: Ted Yu <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>