Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Never ending distributed log split


Copy link to this message
-
Never ending distributed log split
Jean-Marc Spaggiari 2013-06-02, 15:09
My HBase was in a bad state recently. HBCK did a slow but good job and
everything is now almost stable. However, I still have one log split
which is not working. Every minute, the SplitLogManager try to split
the log, fails, and retry. It's always the same file. It's assigned to
different nodes, but all failed, and it's starting again and again.
2013-06-02 10:44:20,298 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
logs to split
2013-06-02 10:44:20,298 INFO
org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs
in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting]
2013-06-02 10:44:20,298 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: wait for status of
task /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
to change to DELETED
2013-06-02 10:44:20,315 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback:
deleted /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
2013-06-02 10:44:20,329 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task
at znode /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
2013-06-02 10:44:20,341 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task
at znode /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
2013-06-02 10:44:20,344 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired
/hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
ver = 0
2013-06-02 10:44:20,346 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired
/hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
ver = 0
2013-06-02 10:44:20,384 INFO
org.apache.hadoop.hbase.master.SplitLogManager: task
/hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
acquired by node1,60020,1370136472290
2013-06-02 10:44:20,410 INFO
org.apache.hadoop.hbase.master.SplitLogManager: task
/hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
acquired by node4,60020,1370136467255
2013-06-02 10:44:20,497 TRACE
org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
of last_update = 1370184260384 last_version = 1 cur_worker_name node1,60020,1370136472290 status = in_progress incarnation = 0
resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
server node1,60020,1370136472290 is not marked as dead, we waited for
113 while the timeout is 300000
2013-06-02 10:44:20,497 TRACE
org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
of last_update = 1370184260410 last_version = 1 cur_worker_name node4,60020,1370136467255 status = in_progress incarnation = 0
resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
server node4,60020,1370136467255 is not marked as dead, we waited for
87 while the timeout is 300000
2013-06-02 10:44:20,497 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
unassigned = 0
2013-06-02 10:44:21,497 TRACE
org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
of last_update = 1370184261377 last_version = 2 cur_worker_name node1,60020,1370136472290 status = in_progress incarnation = 0
resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
server node1,60020,1370136472290 is not marked as dead, we waited for
120 while the timeout is 300000
2013-06-02 10:44:21,497 TRACE
org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
of last_update = 1370184261410 last_version = 2 cur_worker_name node4,60020,1370136467255 status = in_progress incarnation = 0
resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
server node4,60020,1370136467255 is not marked as dead, we waited for
87 while the timeout is 300000
2013-06-02 10:44:21,497 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
unassigned = 0
2013-06-02 10:44:21,708 DEBUG
org.apache.hadoop.hbase.master.ServerManager: REPORT: Server
node7,60020,1370136467731 came back up, removed it from the dead
servers list
2013-06-02 10:44:22,497 TRACE
org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
of last_update = 1370184261377 last_version = 2 cur_worker_name node1,60020,1370136472290 status = in_progress incarnation = 0
resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
server node1,60020,1370136472290 is not marked as dead, we waited for
1120 while the timeout is 300000
2013-06-02 10:44:22,497 TRACE
org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
of last_update = 1370184261410 last_version = 2 cur_worker_name node4,60020,1370136467255 status = in_progress incarnation = 0
resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
server node4,60020,1370136467255 is not marked as dead, we waited for
1087 while the timeout is 300000
2013-06-02 10:44:22,497 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
unassigned = 0
2013-06-02 10:44:23,497 TRACE
org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
of last_update = 1370184261377 last_version = 2 cur_worker_name node1,60020,1370136472290 status = in_progress incarnation = 0
resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
server node1,60020,1370136472290 is not marked as dead, we waited for
2120 while the timeout