-Re: When replication is stopped, .oldlogs is never cleaned
Jean-Daniel Cryans 2013-02-27, 00:23
On Tue, Feb 26, 2013 at 1:03 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
> Thanks for the info, JD.
> My impression was that stop_replication was a fail safe where you can hit
> it if things are going badly and though you had no guarantees about data
> replicating afterward, that the cluster would still function normally from
> a single cluster's perspective. If instead it's only safe to use it for a
> short time, perhaps some better description of the effects of
> stop_replication would be good to add to the documentation.
> We had actually run stop_replication from a previous experiment and the
> znode was still set to false, so when we put the configuration back in
> place it just started up in "stopped" mode and immediately stopped cleaning
> the .oldlogs.
> The ReplicationLogCleaner has additional logic that looks to me like if the
> .logs are referenced by any of the replicationstate in ZK it would prevent
> those logs from being cleaned. It seems to me that if the logs are not
> referenced by anything in ZK that there's no point keeping them around
> while replication is stopped either. What do you think of removing the
> first check that prevents any logs from being cleaned while replication is
> stopped, and relying on the rest of the logic to keep them around?
Well the rest of the logic is part of the replication code, so
logically I think it needs to be disabled too if you kill replication.
It leaves us with the choice of keeping the logs around or not. If you
think the former is dangerous then we should do the latter.
> On Tue, Feb 26, 2013 at 12:53 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:
>> The stop_replication command is really a way to kill it, not a way to
>> stop it. My bad for naming it like that. It should only be used if
>> you're having problems and need to stop all replication activities
>> from happening. It is dirty by nature.
>> It won't clean the logs since you may want to restart replication
>> after killing it. One could make the point that since killing
>> replication is dirty you don't need keep the logs around which would
>> be fair. But to me you should never have to stay on stop_replication
>> more than a few minutes, either you'll continue replicating, you drop
>> the peer, or you disable that peer.
>> FWIW setting hbase.replication to true with no peers should achieve
>> what you want, no need to call stop_replication.
>> On Tue, Feb 26, 2013 at 3:25 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
>> > We have been preparing to enable replication between two large clusters.
>> > For the past couple of weeks, replication has been enabled via
>> > hbase-site.xml, but the replication state has been false (set false by
>> > issuing a stop_replication command).
>> > The master is no longer cleaning any logs from /hbase/.oldlogs It
>> > 2MM+ logs using 140TB of data before we noticed that the hbase master
>> > was growing (about 2GB in use by the LogCleaner form the FileStatus
>> > of this directory). Looking at ReplicationLogCleaner the first check it
>> > makes is that if replication is stopped, then it prevents all logs from
>> > being cleaned which can lead to the master going OOM or HDFS running out
>> > space. I would have expected once replication is stopped that it would
>> > allow logs to be cleaned and expired.
>> > Looking through JIRAs, I suspect this is the cause of
>> > https://issues.apache.org/jira/browse/HBASE-3489
>> > I believe our fix will be to start_replication with no peers enabled,
>> but I
>> > think the ReplicationLogCleaner should be changed. Anyone else care to
>> > weigh in with an opinion? (JD?)
>> > There's also some discussion about the "kill switch" that may be relevant
>> > here:
>> > https://issues.apache.org/jira/browse/HBASE-5222
>> > Dave