Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase Table Row Count Optimization - A Solicitation For Help


Copy link to this message
-
Re: HBase Table Row Count Optimization - A Solicitation For Help
How many nodes do you have in your cluster ?

When counting rows, what other load would be placed on the cluster ?

What is the HBase version you're currently using / planning to use ?

Thanks
On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
[EMAIL PROTECTED]> wrote:

>         After reading the documentation and scouring the mailing list
> archives, I understand there is no real support for fast row counting in
> HBase unless you build some sort of tracking logic into your code.  In our
> case, we do not have such logic, and have massive amounts of data already
> persisted.  I am running into the issue of very long execution of the
> RowCounter MapReduce job against very large tables (multi-billion for many
> is our estimate).  I understand why this issue exists and am slowly
> accepting it, but I am hoping I can solicit some possible ideas to help
> speed things up a little.
>
>         My current task is to provide total row counts on about 600
> tables, some extremely large, some not so much.  Currently, I have a
> process that executes the MapRduce job in process like so:
>
>                         Job job = RowCounter.createSubmittableJob(
>                                         ConfigManager.getConfiguration(),
> new String[]{tableName});
>                         boolean waitForCompletion > job.waitForCompletion(true);
>                         Counters counters = job.getCounters();
>                         Counter rowCounter > counters.findCounter(hbaseadminconnection.Counters.ROWS);
>                         return rowCounter.getValue();
>
>         At the moment, each MapReduce job is executed in serial order, so
> counting one table at a time.  For the current implementation of this whole
> process, as it stands right now, my rough timing calculations indicate that
> fully counting all the rows of these 600 tables will take anywhere between
> 11 to 22 days.  This is not what I consider a desirable timeframe.
>
>         I have considered three alternative approaches to speed things up.
>
>         First, since the application is not heavily CPU bound, I could use
> a ThreadPool and execute multiple MapReduce jobs at the same time looking
> at different tables.  I have never done this, so I am unsure if this would
> cause any unanticipated side effects.
>
>         Second, I could distribute the processes.  I could find as many
> machines that can successfully talk to the desired cluster properly, give
> them a subset of tables to work on, and then combine the results post
> process.
>
>         Third, I could combine both the above approaches and run a
> distributed set of multithreaded process to execute the MapReduce jobs in
> parallel.
>
>         Although it seems to have been asked and answered many times, I
> will ask once again.  Without the need to change our current configurations
> or restart the clusters, is there a faster approach to obtain row counts?
>  FYI, my cache size for the Scan is set to 1000.  I have experimented with
> different numbers, but nothing made a noticeable difference.  Any advice or
> feedback would be greatly appreciated!
>
> Thanks,
> Birch
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB