Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase Table Row Count Optimization - A Solicitation For Help


Copy link to this message
-
Re: HBase Table Row Count Optimization - A Solicitation For Help
Himanshu Vashishtha 2013-09-26, 21:20
Sorry for chiming in late here.

The Aggregation coprocessor works well for smaller datasets, or in case you
are computing it on a range of a table.

During its development phase, I used to do row count of 1m, 10m rows
(spanning across about 25 regions for the test table). In its current form,
I would avoid using it for tables bigger than that.

In case you are scanning a huge data set (which you are doing), there is a
chance your request would fail: you get SocketTimeOutException if your
request is under processing for more than (default) client rpc timeout of
60 sec, for e.g. Or, it may block other requests in case all the
regionserver handlers gets busy processing your request.

I would use the rowcounter mapreduce job (co-locating it with hbase
cluster) in order to get the result within a decent processing time.

FWIW, there is also a blog post on the above findings I mentioned here:
http://hbase-coprocessor-experiments.blogspot.com/
If you find it long, just skim to the RowCount section :)

Thanks,
Himanshu
On Tue, Sep 24, 2013 at 9:33 AM, James Birchfield <
[EMAIL PROTECTED]> wrote:

> Just wanted to follow up here with a little update.  We enabled the
> Aggregation coprocessor on our dev cluster.  Here are the quick timing
> stats.
>
> Tables: 565
> Total Rows: 2,749,015,957
> Total Time (to count): 52m:33s
>
> Will be interesting to see how this fairs against our production clusters
> with a lot more data.
>
> Thanks again for all of your help!
> Birch
> On Sep 20, 2013, at 10:06 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> > Hey we all start somewhere. I did the "LocalJobRunner" thing many times
> and wondered why it was so slow, until I realized I hadn't setup my client
> correctly.
> > The LocalJobRunner runs the M/R job on the client machine. This is
> really just for testing and terribly slow.
> >
> > From later emails in this I gather you managed to run this as an actual
> M/R on the cluster? (by the way you do not need to start the job on a
> machine on the cluster, but just configure your client correctly to ship
> the job to the M/R cluster)
> >
> >
> > Was that still too slow? I would love to get my hand on some numbers. If
> you have trillions of rows and can run this job with a few mappers per
> machines, those would be good numbers to publish here.
> > In any case, let us know how it goes.
> >
> >
> > -- Lars
> >
> >
> > btw. my calculation were assuming that network IO is the bottleneck. For
> > larger jobs (such as yours) it's typically either that or disk IO.
> > ________________________________
> >
> > From: James Birchfield <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> > Sent: Friday, September 20, 2013 6:21 PM
> > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help
> >
> >
> > Thanks Lars.  I like your time calculations much better than mine.
> >
> > So this is where my inexperience is probably going to come glaring
> through.  And maybe the root of all this.  I am not running the MapReduce
> job on a node in the cluster.  It is running on a development server that
> connects remotely to the cluster.  Further more, I am not executing the
> MpReduce job from the command line using the CLI as seen in many of the
> examples.  I am executing them in process of a stand-alone Java process I
> have written.  It is simple in nature, it simply creates an HBaseAdmin
> connection, list the tables and looks up the column families, code the
> admin connection, then loops over the table list, and runs the following
> code:
> >
> > public class RowCounterRunner {
> >
> >     public static long countRows(String tableName) throws Exception {
> >
> >         Job job = RowCounter.createSubmittableJob(
> >                 ConfigManager.getConfiguration(), new
> String[]{tableName});
> >         boolean waitForCompletion = job.waitForCompletion(true);
> >         Counters counters = job.getCounters();
> >         Counter findCounter > counters.findCounter(hbaseadminconnection.Counters.ROWS);