Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase Table Row Count Optimization - A Solicitation For Help


Copy link to this message
-
Re: HBase Table Row Count Optimization - A Solicitation For Help
Ted Yu 2013-09-20, 22:19
How many nodes do you have in your cluster ?

When counting rows, what other load would be placed on the cluster ?

What is the HBase version you're currently using / planning to use ?

Thanks
On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
[EMAIL PROTECTED]> wrote:

>         After reading the documentation and scouring the mailing list
> archives, I understand there is no real support for fast row counting in
> HBase unless you build some sort of tracking logic into your code.  In our
> case, we do not have such logic, and have massive amounts of data already
> persisted.  I am running into the issue of very long execution of the
> RowCounter MapReduce job against very large tables (multi-billion for many
> is our estimate).  I understand why this issue exists and am slowly
> accepting it, but I am hoping I can solicit some possible ideas to help
> speed things up a little.
>
>         My current task is to provide total row counts on about 600
> tables, some extremely large, some not so much.  Currently, I have a
> process that executes the MapRduce job in process like so:
>
>                         Job job = RowCounter.createSubmittableJob(
>                                         ConfigManager.getConfiguration(),
> new String[]{tableName});
>                         boolean waitForCompletion > job.waitForCompletion(true);
>                         Counters counters = job.getCounters();
>                         Counter rowCounter > counters.findCounter(hbaseadminconnection.Counters.ROWS);
>                         return rowCounter.getValue();
>
>         At the moment, each MapReduce job is executed in serial order, so
> counting one table at a time.  For the current implementation of this whole
> process, as it stands right now, my rough timing calculations indicate that
> fully counting all the rows of these 600 tables will take anywhere between
> 11 to 22 days.  This is not what I consider a desirable timeframe.
>
>         I have considered three alternative approaches to speed things up.
>
>         First, since the application is not heavily CPU bound, I could use
> a ThreadPool and execute multiple MapReduce jobs at the same time looking
> at different tables.  I have never done this, so I am unsure if this would
> cause any unanticipated side effects.
>
>         Second, I could distribute the processes.  I could find as many
> machines that can successfully talk to the desired cluster properly, give
> them a subset of tables to work on, and then combine the results post
> process.
>
>         Third, I could combine both the above approaches and run a
> distributed set of multithreaded process to execute the MapReduce jobs in
> parallel.
>
>         Although it seems to have been asked and answered many times, I
> will ask once again.  Without the need to change our current configurations
> or restart the clusters, is there a faster approach to obtain row counts?
>  FYI, my cache size for the Scan is set to 1000.  I have experimented with
> different numbers, but nothing made a noticeable difference.  Any advice or
> feedback would be greatly appreciated!
>
> Thanks,
> Birch