|
Tom Brown
2012-09-10, 17:32
Andrew Purtell
2012-09-10, 18:31
Michael Segel
2012-09-10, 18:45
Tom Brown
2012-09-10, 20:39
Tom Brown
2012-09-12, 17:40
Andrew Purtell
2012-09-12, 18:06
|
-
Tracking down coprocessor pausesTom Brown 2012-09-10, 17:32
Hi,
We have our system setup such that all interaction is done through co-processors. We update the database via a co-processor (it has the appropriate logic for dealing with concurrent access to rows), and we also query/aggregate via co-processor (since we don't want to send all the data over the network). This generally works very well. However, some times one of the region servers will "pause". This doesn't appear to be a GC pause since it still serves up the UI, and adds occasional messages to the log regarding the LRU. The only thing I've found is that when I check the server that's causing the problem (easy to tell, since all the "working" servers have a low load, and the problem server has a higher load), I can see that there are a number of execCoprocessor requests that have been executing for much longer than they should. I want to know more details about the specifics of those requests; Is there an API I can use that will allow my coprocessor requests to be tracked more functionally? Is there a way to hook into the UI so I can provide my own list of running processes? Or would I have to write that all myself? I am using HBase 0.92.1, but will be upgrading to 0.94.1 soon. Thanks in advance! --Tom
-
Re: Tracking down coprocessor pausesAndrew Purtell 2012-09-10, 18:31
On Mon, Sep 10, 2012 at 10:32 AM, Tom Brown <[EMAIL PROTECTED]> wrote:
> I want to know more details about the specifics of those requests; Is > there an API I can use that will allow my coprocessor requests to be > tracked more functionally? Is there a way to hook into the UI so I can > provide my own list of running processes? Or would I have to write > that all myself? > > I am using HBase 0.92.1, but will be upgrading to 0.94.1 soon. I haven't actually done this, so YMMV, but you should be able to get a reference to the TaskMonitor singleton (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/monitoring/TaskMonitor.html) via the static method TaskMonitor.get() and then create and update the state of MonitoredTasks (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/monitoring/MonitoredTask.html) for your coprocessor's internal functions. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
Re: Tracking down coprocessor pausesMichael Segel 2012-09-10, 18:45
On Sep 10, 2012, at 12:32 PM, Tom Brown <[EMAIL PROTECTED]> wrote: > We have our system setup such that all interaction is done through > co-processors. We update the database via a co-processor (it has the > appropriate logic for dealing with concurrent access to rows), and we > also query/aggregate via co-processor (since we don't want to send all > the data over the network). Could you expand on this? On the surface, this doesn't sound like a very good idea.
-
Re: Tracking down coprocessor pausesTom Brown 2012-09-10, 20:39
Micheal,
We are using HBase to track the usage of our service. Specifically, each client sends an update when they start a task, at regular intervals during the task, and an update when they finish a task (and then presumably they start another, continuing the cycle). Each user has various attributes (which version of our software they're using, their location, which task they're working on, etc), and we want to be able to see stats in aggregate, and be able to drill-down into various areas (similar to OLAP; Incidentally, we chose HBase because none of the OLAP systems seemed to accept real-time updates). The key design is a compound of: [Attribute1 Attribute2... AttributeN]. Each row has roughly 10 cells, all of which represent counters; Some require simple incrementing, but others require fancier bitwise operations to properly increment (using HyperLogLog to estimate a unique count). The rows are stored with a 15-second granularity (everything from 0:00-0:15 is stored in one row, everything from 0:15-0:30 is in the next, etc). The data is formatted such that you can get the aggregation for a larger time period by combining all of the rows that comprise that time frame. For the counter cells, this uses straight addition. For the unique counters, bitwise operations are required. The most frequently requested data has only one or two relevant attributes. For example, we commonly want to see the stats of our system broken out just by task. Of course, that makes writes a little more difficult. When we have 1000's of users working on the same kind of task, we'll receive a lot of concurrent updates to the row with [attribute=TheTask]. HBase supports atomic increments, but not atomic bitwise operations, so we were required to implement a locking solution on our own. There seemed to be a lot of problems with row-level locks, so we decided to do the locking in the one place we could guarantee it: a coprocessor. Within the coprocessor is logic to coalesce multiple updates to the same row into a single HBase update. When performing aggregations, a requested time period might summarize thousands of rows into a single summary row. We thought that sending the entire set over the network was overkill, especially since the aggregation operations are fairly simple (addition and some bitwise calculations), so the co-processor also contains code to perform aggregations. I'm interested in improving the design, so any suggestions will be appreciated. Thanks in advance, --Tom On Mon, Sep 10, 2012 at 12:45 PM, Michael Segel <[EMAIL PROTECTED]> wrote: > > On Sep 10, 2012, at 12:32 PM, Tom Brown <[EMAIL PROTECTED]> wrote: > >> We have our system setup such that all interaction is done through >> co-processors. We update the database via a co-processor (it has the >> appropriate logic for dealing with concurrent access to rows), and we >> also query/aggregate via co-processor (since we don't want to send all >> the data over the network). > > Could you expand on this? On the surface, this doesn't sound like a very good idea. >
-
Re: Tracking down coprocessor pausesTom Brown 2012-09-12, 17:40
I have captured some logs from what is happening during one of these pauses.
http://pastebin.com/K162Einz Can someone help me figure out what's actually going on from these logs? --- My interpretation of the logs --- As you can see at the start of the logs, my coprocessor for updating the data is executing rapidly until 10:17:06. At that time the coprocessor for querying is invoked. This query should take only moments to return, but doesn't return until 10:44:52. At 10:18:53 there appear to be some compaction related messages (though they didn't appear to be the cause, happening over a minute after the server stops functioning). It appears to run compaction until 10:42:25. The next two minutes contain just LRU eviction messages. At 10:44:52, the query from earlier appears to complete, after having summarized only 863 rows. A few other queued requests are attempted, but fail with exceptions (ClosedChannelException). Eventually the exceptions are being thrown from "openScanner", which really doesn't sound good to me. --Tom On Mon, Sep 10, 2012 at 11:32 AM, Tom Brown <[EMAIL PROTECTED]> wrote: > Hi, > > We have our system setup such that all interaction is done through > co-processors. We update the database via a co-processor (it has the > appropriate logic for dealing with concurrent access to rows), and we > also query/aggregate via co-processor (since we don't want to send all > the data over the network). > > This generally works very well. However, some times one of the region > servers will "pause". This doesn't appear to be a GC pause since it > still serves up the UI, and adds occasional messages to the log > regarding the LRU. The only thing I've found is that when I check the > server that's causing the problem (easy to tell, since all the > "working" servers have a low load, and the problem server has a higher > load), I can see that there are a number of execCoprocessor requests > that have been executing for much longer than they should. > > I want to know more details about the specifics of those requests; Is > there an API I can use that will allow my coprocessor requests to be > tracked more functionally? Is there a way to hook into the UI so I can > provide my own list of running processes? Or would I have to write > that all myself? > > I am using HBase 0.92.1, but will be upgrading to 0.94.1 soon. > > Thanks in advance! > > --Tom
-
Re: Tracking down coprocessor pausesAndrew Purtell 2012-09-12, 18:06
Inline
On Wed, Sep 12, 2012 at 10:40 AM, Tom Brown <[EMAIL PROTECTED]> wrote: > I have captured some logs from what is happening during one of these pauses. > > http://pastebin.com/K162Einz > > Can someone help me figure out what's actually going on from these logs? > > --- My interpretation of the logs --- > > As you can see at the start of the logs, my coprocessor for updating > the data is executing rapidly until 10:17:06. > > At that time the coprocessor for querying is invoked. This query > should take only moments to return, but doesn't return until 10:44:52. Here it would be helpful to get a stacktrace from the regionserver where the CP is executing, to see where the RPC threads servicing the CP invocations are hung up. > > At 10:18:53 there appear to be some compaction related messages > (though they didn't appear to be the cause, happening over a minute > after the server stops functioning). > > It appears to run compaction until 10:42:25. The next two minutes > contain just LRU eviction messages. > > At 10:44:52, the query from earlier appears to complete, after having > summarized only 863 rows. A few other queued requests are attempted, > but fail with exceptions (ClosedChannelException). > > Eventually the exceptions are being thrown from "openScanner", which > really doesn't sound good to me. ChannelClosedExceptions appear to be from RPC service threads, now unstuck, processing queued up CP invocations but the client has given up, so they can't write back results and error out. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) |