Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase Table Row Count Optimization - A Solicitation For Help


Copy link to this message
-
Re: HBase Table Row Count Optimization - A Solicitation For Help
In 0.94, we have AggregateImplementation, an endpoint coprocessor, which
implements getRowNum().

Example is in AggregationClient.java

Cheers
On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> From your numbers below you have about 26k regions, thus each region is
> about 545tb/26k = 20gb. Good.
>
> How many mappers are you running?
> And just to rule out the obvious, the M/R is running on the cluster and
> not locally, right? (it will default to a local runner when it cannot use
> the M/R cluster).
>
> Some back of the envelope calculations tell me that assuming 1ge network
> cards, the best you can expect for 110 machines to map through this data is
> about 10h. (so way faster than what you see).
> (545tb/(110*1/8gb/s) ~ 40ks ~11h)
>
>
> We should really add a rowcounting coprocessor to HBase and allow using it
> via M/R.
>
> -- Lars
>
>
>
> ________________________________
>  From: James Birchfield <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, September 20, 2013 5:09 PM
> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help
>
>
> I did not implement accurate timing, but the current table being counted
> has been running for about 10 hours, and the log is estimating the map
> portion at 10%
>
> 2013-09-20 23:40:24,099 INFO  [main] Job                            :  map
> 10% reduce 0%
>
> So a loooong time.  Like I mentioned, we have billions, if not trillions
> of rows potentially.
>
> Thanks for the feedback on the approaches I mentioned.  I was not sure if
> they would have any effect overall.
>
> I will look further into coprocessors.
>
> Thanks!
> Birch
> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]>
> wrote:
>
> > How long does it take for RowCounter Job for largest table to finish on
> your cluster?
> >
> > Just curious.
> >
> > On your options:
> >
> > 1. Not worth it probably - you may overload your cluster
> > 2. Not sure this one differs from 1. Looks the same to me but more
> complex.
> > 3. The same as 1 and 2
> >
> > Counting rows in efficient way can be done if you sacrifice some
> accuracy :
> >
> >
> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
> >
> > Yeah, you will need coprocessors for that.
> >
> > Best regards,
> > Vladimir Rodionov
> > Principal Platform Engineer
> > Carrier IQ, www.carrieriq.com
> > e-mail: [EMAIL PROTECTED]
> >
> > ________________________________________
> > From: James Birchfield [[EMAIL PROTECTED]]
> > Sent: Friday, September 20, 2013 3:50 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help
> >
> > Hadoop 2.0.0-cdh4.3.1
> >
> > HBase 0.94.6-cdh4.3.1
> >
> > 110 servers, 0 dead, 238.2364 average load
> >
> > Some other info, not sure if it helps or not.
> >
> > Configured Capacity: 1295277834158080 (1.15 PB)
> > Present Capacity: 1224692609430678 (1.09 PB)
> > DFS Remaining: 624376503857152 (567.87 TB)
> > DFS Used: 600316105573526 (545.98 TB)
> > DFS Used%: 49.02%
> > Under replicated blocks: 0
> > Blocks with corrupt replicas: 1
> > Missing blocks: 0
> >
> > It is hitting a production cluster, but I am not really sure how to
> calculate the load placed on the cluster.
> > On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> >> How many nodes do you have in your cluster ?
> >>
> >> When counting rows, what other load would be placed on the cluster ?
> >>
> >> What is the HBase version you're currently using / planning to use ?
> >>
> >> Thanks
> >>
> >>
> >> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
> >> [EMAIL PROTECTED]> wrote:
> >>
> >>>       After reading the documentation and scouring the mailing list
> >>> archives, I understand there is no real support for fast row counting
> in
> >>> HBase unless you build some sort of tracking logic into your code.  In
> our
> >>> case, we do not have such logic, and have massive amounts of data
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB