Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Never ending distributed log split


+
Jean-Marc Spaggiari 2013-06-02, 15:09
+
Stack 2013-06-03, 04:35
Copy link to this message
-
Re: Never ending distributed log split
Ted Yu 2013-06-02, 15:46
Can you search for 1d44b0630ed7785106a87a2bd4993551/recovered.edits to see
when it was created ?
Namenode log would be a good place to start with.

bq. we can also rename it so if really required we can replay it later?

The above is a better way of handling the situation.

What version of HBase are you using ?

Cheers

On Sun, Jun 2, 2013 at 8:09 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]
> wrote:

> My HBase was in a bad state recently. HBCK did a slow but good job and
> everything is now almost stable. However, I still have one log split
> which is not working. Every minute, the SplitLogManager try to split
> the log, fails, and retry. It's always the same file. It's assigned to
> different nodes, but all failed, and it's starting again and again.
>
>
> 2013-06-02 10:44:20,298 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
> logs to split
> 2013-06-02 10:44:20,298 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs
> in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting]
> 2013-06-02 10:44:20,298 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: wait for status of
> task
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> to change to DELETED
> 2013-06-02 10:44:20,315 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback:
> deleted
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> 2013-06-02 10:44:20,329 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task
> at znode
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> 2013-06-02 10:44:20,341 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task
> at znode
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
> 2013-06-02 10:44:20,344 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> ver = 0
> 2013-06-02 10:44:20,346 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
> ver = 0
> 2013-06-02 10:44:20,384 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: task
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> acquired by node1,60020,1370136472290
> 2013-06-02 10:44:20,410 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: task
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
> acquired by node4,60020,1370136467255
> 2013-06-02 10:44:20,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184260384 last_version = 1 cur_worker_name > node1,60020,1370136472290 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node1,60020,1370136472290 is not marked as dead, we waited for
> 113 while the timeout is 300000
> 2013-06-02 10:44:20,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184260410 last_version = 1 cur_worker_name > node4,60020,1370136467255 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node4,60020,1370136467255 is not marked as dead, we waited for
+
Jean-Marc Spaggiari 2013-06-02, 17:05