Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase Table Row Count Optimization - A Solicitation For Help


Copy link to this message
-
Re: HBase Table Row Count Optimization - A Solicitation For Help
Ted Yu 2013-09-21, 01:17
In 0.94, we have AggregateImplementation, an endpoint coprocessor, which
implements getRowNum().

Example is in AggregationClient.java

Cheers
On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> From your numbers below you have about 26k regions, thus each region is
> about 545tb/26k = 20gb. Good.
>
> How many mappers are you running?
> And just to rule out the obvious, the M/R is running on the cluster and
> not locally, right? (it will default to a local runner when it cannot use
> the M/R cluster).
>
> Some back of the envelope calculations tell me that assuming 1ge network
> cards, the best you can expect for 110 machines to map through this data is
> about 10h. (so way faster than what you see).
> (545tb/(110*1/8gb/s) ~ 40ks ~11h)
>
>
> We should really add a rowcounting coprocessor to HBase and allow using it
> via M/R.
>
> -- Lars
>
>
>
> ________________________________
>  From: James Birchfield <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, September 20, 2013 5:09 PM
> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help
>
>
> I did not implement accurate timing, but the current table being counted
> has been running for about 10 hours, and the log is estimating the map
> portion at 10%
>
> 2013-09-20 23:40:24,099 INFO  [main] Job                            :  map
> 10% reduce 0%
>
> So a loooong time.  Like I mentioned, we have billions, if not trillions
> of rows potentially.
>
> Thanks for the feedback on the approaches I mentioned.  I was not sure if
> they would have any effect overall.
>
> I will look further into coprocessors.
>
> Thanks!
> Birch
> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]>
> wrote:
>
> > How long does it take for RowCounter Job for largest table to finish on
> your cluster?
> >
> > Just curious.
> >
> > On your options:
> >
> > 1. Not worth it probably - you may overload your cluster
> > 2. Not sure this one differs from 1. Looks the same to me but more
> complex.
> > 3. The same as 1 and 2
> >
> > Counting rows in efficient way can be done if you sacrifice some
> accuracy :
> >
> >
> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
> >
> > Yeah, you will need coprocessors for that.
> >
> > Best regards,
> > Vladimir Rodionov
> > Principal Platform Engineer
> > Carrier IQ, www.carrieriq.com
> > e-mail: [EMAIL PROTECTED]
> >
> > ________________________________________
> > From: James Birchfield [[EMAIL PROTECTED]]
> > Sent: Friday, September 20, 2013 3:50 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help
> >
> > Hadoop 2.0.0-cdh4.3.1
> >
> > HBase 0.94.6-cdh4.3.1
> >
> > 110 servers, 0 dead, 238.2364 average load
> >
> > Some other info, not sure if it helps or not.
> >
> > Configured Capacity: 1295277834158080 (1.15 PB)
> > Present Capacity: 1224692609430678 (1.09 PB)
> > DFS Remaining: 624376503857152 (567.87 TB)
> > DFS Used: 600316105573526 (545.98 TB)
> > DFS Used%: 49.02%
> > Under replicated blocks: 0
> > Blocks with corrupt replicas: 1
> > Missing blocks: 0
> >
> > It is hitting a production cluster, but I am not really sure how to
> calculate the load placed on the cluster.
> > On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> >> How many nodes do you have in your cluster ?
> >>
> >> When counting rows, what other load would be placed on the cluster ?
> >>
> >> What is the HBase version you're currently using / planning to use ?
> >>
> >> Thanks
> >>
> >>
> >> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
> >> [EMAIL PROTECTED]> wrote:
> >>
> >>>       After reading the documentation and scouring the mailing list
> >>> archives, I understand there is no real support for fast row counting
> in
> >>> HBase unless you build some sort of tracking logic into your code.  In
> our
> >>> case, we do not have such logic, and have massive amounts of data