-Re: Long running replication: possible improvements
Jean-Daniel Cryans 2012-07-26, 23:58
On Wed, Jul 25, 2012 at 5:58 PM, Himanshu Vashishtha
<[EMAIL PROTECTED]> wrote:
> Replication works good when run in short span. But its performance for
> a long running setup seems to degrade at the slave cluster side. To an
> extant, it made it unresponsive in one of our testing environment. As
> per jstack on one node, all its priority handlers were blocked in the
> replicateLogEntries method, which is blocked as the cluster is in bad
> shape (2/4 nodes died; root is unassigned; and the node which had it
> previously became un-responsive; and the only other remaining node
> doesn't have any priority handler left to take care of the root region
Currently the best way to fix this would be to have a separate set of
> The memory footprint of the app also increases (based on
> `top`; unfortunately, no gc logs at the moment).
You don't want to rely on top for that since it's a java application.
Set you Xms as big as your Xmx and your application will always use
all the memory it's given.
> The replicateLogEntries is a high QOS method; ReplicationSink's
> overall behavior is to act as a native hbase client and replicate the
> mutations in its cluster. This may take some time, in case region is
> splitting, possible gc pause, etc at the target region servers. It
> enters in the retrying loop, and this blocks the priority handler
> serving that method.
> Meanwhile, other master cluster region servers are also shipping edits
> (to this, or other regionservers). This makes the situation more
> I wonder whether others have seen this before. Please share.
See my first answer.
> There is some scope of improvements at Sink side:
> a) ReplicationSink#replicateLogEntries: Make it a normal operation (no
> high QOS annotation), and ReplicationSink periodically checks whether
> the client is still connected or not. In case its not, just throws an
> exception and bail out. The client will do a resend of the shipment
> anyway. This frees up the handlers from blocking, and cluster's
> normal operation will not be impeded.
It wasn't working any better before HBASE-4280 :)
> b) Have a threadpool in ReplicationSink and process per table request
> in parallel. Should help in case of multi table replication.
Currently it's trying to apply the edits sequentially, going parallel
would apply them in the wrong order. Note that when a region server
fail we do continue to replicate the new edits while we also replicate
the backlog from the old server so currently it's not 100% perfect.
> c) Freeing the memory consumed by the shipped array, as soon as the
> mutation list is populated. Currently, if the call to multi is blocked
> (by any reason), the regionserver enters in the retrying logic... and
> since entries of WALEdits array is copied as Put/Delete objects, it
> can be freed.
So free up the entries array at each position after the Put or Delete
was created? We could do that, although it's not a big saving
considering that entries will be at most 64MB big. In production here
we run with just 1 MB.