Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> Replication hosed after simple cluster restart


+
lars hofhansl 2013-03-14, 01:12
+
lars hofhansl 2013-03-14, 01:22
+
Stack 2013-03-14, 01:43
+
lars hofhansl 2013-03-14, 01:52
+
Ted Yu 2013-03-14, 01:17
+
lars hofhansl 2013-03-14, 01:27
+
Himanshu Vashishtha 2013-03-14, 01:38
+
lars hofhansl 2013-03-14, 01:45
+
Ted Yu 2013-03-14, 03:06
+
Andrew Purtell 2013-03-14, 03:22
+
Himanshu Vashishtha 2013-03-14, 03:26
+
lars hofhansl 2013-03-14, 03:48
+
Himanshu Vashishtha 2013-03-14, 03:59
+
lars hofhansl 2013-03-14, 04:16
+
Jean-Marc Spaggiari 2013-03-14, 12:39
+
Himanshu Vashishtha 2013-03-14, 03:13
+
Himanshu Vashishtha 2013-03-14, 01:51
Copy link to this message
-
Re: Replication hosed after simple cluster restart
Err.. The jira is https://issues.apache.org/jira/browse/HBASE-7122

On Wed, Mar 13, 2013 at 6:51 PM, Himanshu Vashishtha
<[EMAIL PROTECTED]> wrote:
> The log message you are seeing have been there for a long time I
> remember (it is a debug level message).
> I had a patch long back
> https://issues.apache.org/jira/browse/HBASE-7937, which became stale.
>
> Stack: it is not the fault of the multi command, it is the way the
> code is using it is wrong. There is a race b/w reading and moving the
> znodes. Basically, what should be done is in case a regionserver fails
> to move the znodes, it should return an empty list, and NOT what it
> has read earlier. This is because other regionsever might have moved
> the znodes.
>
> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>> Hey no problem. It's cool that we found it in a test env. It's probably
>> quite hard to reproduce.
>> This is in 0.94.5 but this feature is off by default.
>>
>> What's the general thought here, should I kill the current 0.94.6 rc for
>> this?
>> My gut says: Yes.
>>
>>
>> I'm also a bit worried about these:
>> 2013-03-14 01:42:42,271 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
>> log for replication
>> shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0
>> 2013-03-14 01:42:42,358 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got:
>> java.io.EOFException
>>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
>> 2013-03-14 01:42:42,358 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
>> too long for this file, considering dumping
>> 2013-03-14 01:42:42,358 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
>> to open a reader, sleeping 1000 times 10
>>
>> This happens after bouncing the cluster a 2nd time and these messages repeat
>> every 10s (for hours now). This is a separate problem I think.
>>
>> -- Lars
>>
>> ________________________________
>> From: Himanshu Vashishtha <[EMAIL PROTECTED]>
>>
>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
>> Cc: Ted Yu <[EMAIL PROTECTED]>
>> Sent: Wednesday, March 13, 2013 6:38 PM
>>
>> Subject: Re: Replication hosed after simple cluster restart
>>
>> This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
>> might not be able to move later on, resulting in bogus znodes.
>> I'll fix this asap. Weird it didn't happen in my testing earlier.
>> Sorry about this.
>>
>> On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>> Sorry 0.94.6RC1
>>> (I complain about folks not reporting the version all the time, and then I
>>> do it too)
>>>
>>>
>>>
>>> ________________________________
>>>  From: Ted Yu <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB