We have been preparing to enable replication between two large clusters.
For the past couple of weeks, replication has been enabled via
hbase-site.xml, but the replication state has been false (set false by
issuing a stop_replication command).
The master is no longer cleaning any logs from /hbase/.oldlogs It reached
2MM+ logs using 140TB of data before we noticed that the hbase master heap
was growing (about 2GB in use by the LogCleaner form the FileStatus objects
of this directory). Looking at ReplicationLogCleaner the first check it
makes is that if replication is stopped, then it prevents all logs from
being cleaned which can lead to the master going OOM or HDFS running out of
space. I would have expected once replication is stopped that it would
allow logs to be cleaned and expired.
Looking through JIRAs, I suspect this is the cause of
I believe our fix will be to start_replication with no peers enabled, but I
think the ReplicationLogCleaner should be changed. Anyone else care to
weigh in with an opinion? (JD?)
There's also some discussion about the "kill switch" that may be relevant