|
|
-
replication - how do I know the status?
Neil Yalowitz 2012-09-13, 20:18
Hi all,
I'm using HBase replication between two clusters running CDH3u3 and I recently noticed that a replicated column family was "lagging" by more than a day... that is, it required more than 24 hours for a Put to replicate from master to slave. The root cause of the lag appears to be swapping and other bad behavior.
The real question I have is this: how do I know the state of replication at any given time? Does a large amount of data in /hbase/.logs indicate that replication is falling behind? What about /hbase/.oldlogs which seems to grow forever? What red flags should I look for to tell me that there is a problem with replication? Neil Yalowitz [EMAIL PROTECTED]
-
Re: replication - how do I know the status?
Jean-Daniel Cryans 2012-09-13, 21:18
The best metric at the moment is hbase.replication.sizeOfLogQueue published through JMX. If your have Ganglia, opentsdb or Cacti you can graph how many logs per server need to be replicated and then you'll have a good idea of how much data needs to be replicated.
If it goes up to more than 2 per server for a few minutes, you know you are either slowing down or someone is inserting a lot of data.
J-D
On Thu, Sep 13, 2012 at 1:18 PM, Neil Yalowitz <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm using HBase replication between two clusters running CDH3u3 and I > recently noticed that a replicated column family was "lagging" by more than > a day... that is, it required more than 24 hours for a Put to replicate > from master to slave. The root cause of the lag appears to be swapping and > other bad behavior. > > The real question I have is this: how do I know the state of replication at > any given time? Does a large amount of data in /hbase/.logs indicate that > replication is falling behind? What about /hbase/.oldlogs which seems to > grow forever? What red flags should I look for to tell me that there is a > problem with replication? > > > Neil Yalowitz > [EMAIL PROTECTED]
-
Re: replication - how do I know the status?
Neil Yalowitz 2012-09-13, 21:28
This is a great answer, I can see that particular ganglia metric sharply increased when the issue began. Thanks much.
One followup question:
Can a distressed slave cluster cause performance issues on the master cluster? It appears our performance problem was occurring on the slave peer, but the master cluster almost crashed as well. I'm trying to determine if that was a coincidence or something more... Neil
On Thu, Sep 13, 2012 at 5:18 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:
> The best metric at the moment is hbase.replication.sizeOfLogQueue > published through JMX. If your have Ganglia, opentsdb or Cacti you can > graph how many logs per server need to be replicated and then you'll > have a good idea of how much data needs to be replicated. > > If it goes up to more than 2 per server for a few minutes, you know > you are either slowing down or someone is inserting a lot of data. > > J-D > > On Thu, Sep 13, 2012 at 1:18 PM, Neil Yalowitz <[EMAIL PROTECTED]> > wrote: > > Hi all, > > > > I'm using HBase replication between two clusters running CDH3u3 and I > > recently noticed that a replicated column family was "lagging" by more > than > > a day... that is, it required more than 24 hours for a Put to replicate > > from master to slave. The root cause of the lag appears to be swapping > and > > other bad behavior. > > > > The real question I have is this: how do I know the state of replication > at > > any given time? Does a large amount of data in /hbase/.logs indicate > that > > replication is falling behind? What about /hbase/.oldlogs which seems to > > grow forever? What red flags should I look for to tell me that there is > a > > problem with replication? > > > > > > Neil Yalowitz > > [EMAIL PROTECTED] >
-
Re: replication - how do I know the status?
Jean-Daniel Cryans 2012-09-13, 21:48
On Thu, Sep 13, 2012 at 2:28 PM, Neil Yalowitz <[EMAIL PROTECTED]> wrote: > This is a great answer, I can see that particular ganglia metric sharply > increased when the issue began. Thanks much.
Nice!
> > One followup question: > > Can a distressed slave cluster cause performance issues on the master > cluster? It appears our performance problem was occurring on the slave > peer, but the master cluster almost crashed as well. I'm trying to > determine if that was a coincidence or something more...
That's a tougher one, but FWIW the work required on the master cluster is low compared to what the slave has to do; the master just needs to read a bunch of edits and send them whereas the slave has to write them to the WAL, add them to the MemStore, eventually flush and compact, etc.
Also if you had a big MR job that ran on the master and that inserted a lot of data, I would assume that it made everything slower. If it's also what caused swapping then it would explain a lot.
J-D
|
|