Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase Table Row Count Optimization - A Solicitation For Help


Copy link to this message
-
Re: HBase Table Row Count Optimization - A Solicitation For Help
bq. FirstKeyFilter *should* be faster since it only grabs the first KV pair.

Minor correction: FirstKeyFilter above should be FirstKeyOnlyFilter
On Fri, Sep 20, 2013 at 5:53 PM, James Birchfield <
[EMAIL PROTECTED]> wrote:

> Thanks for the info.
>
> Right now the MapReduce Scan uses the FirstKeyOnlyFilter.  From what I
> have read in the javadoc, FirstKeyFilter *should* be faster since it only
> grabs the first KV pair.
>
> I will play around with setting the caching size to a much higher number
> and see how it performs.  I do not think I have too much wiggle room to
> modify our cluster configurations, but will see what I can do.
>
> Thanks!
>
> Birch
> On Sep 20, 2013, at 5:39 PM, Bryan Beaudreault <[EMAIL PROTECTED]>
> wrote:
>
> > If your cells are extremely small try setting the caching even higher
> than
> > 10k.  You want to strike a balance between MBs of response size and
> number
> > of calls needed, leaning towards larger response sizes as far as your
> > system can handle (account for RS max memory, and memory available to
> your
> > mappers).
> >
> > You could use the KeyOnlyFilter to further limit the sizes of responses
> > transferred.
> >
> > Another thing that may help would be increasing your block size.  This
> > would speed up sequential read but slow down random access.  It would be
> a
> > matter of making the config change and then running a major compaction to
> > re-write the data.
> >
> > We constantly run multiple MR jobs (often on the order of 10's) against
> the
> > same hbase cluster and don't often see issues.  They are not full table
> > scans, but they do often overlap.  I think it would be worth at least
> > attempting to run multiple jobs at once.
> >
> >
> >
> >
> > On Fri, Sep 20, 2013 at 8:09 PM, James Birchfield <
> > [EMAIL PROTECTED]> wrote:
> >
> >> I did not implement accurate timing, but the current table being counted
> >> has been running for about 10 hours, and the log is estimating the map
> >> portion at 10%
> >>
> >> 2013-09-20 23:40:24,099 INFO  [main] Job                            :
>  map
> >> 10% reduce 0%
> >>
> >> So a loooong time.  Like I mentioned, we have billions, if not trillions
> >> of rows potentially.
> >>
> >> Thanks for the feedback on the approaches I mentioned.  I was not sure
> if
> >> they would have any effect overall.
> >>
> >> I will look further into coprocessors.
> >>
> >> Thanks!
> >> Birch
> >> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]
> >
> >> wrote:
> >>
> >>> How long does it take for RowCounter Job for largest table to finish on
> >> your cluster?
> >>>
> >>> Just curious.
> >>>
> >>> On your options:
> >>>
> >>> 1. Not worth it probably - you may overload your cluster
> >>> 2. Not sure this one differs from 1. Looks the same to me but more
> >> complex.
> >>> 3. The same as 1 and 2
> >>>
> >>> Counting rows in efficient way can be done if you sacrifice some
> >> accuracy :
> >>>
> >>>
> >>
> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
> >>>
> >>> Yeah, you will need coprocessors for that.
> >>>
> >>> Best regards,
> >>> Vladimir Rodionov
> >>> Principal Platform Engineer
> >>> Carrier IQ, www.carrieriq.com
> >>> e-mail: [EMAIL PROTECTED]
> >>>
> >>> ________________________________________
> >>> From: James Birchfield [[EMAIL PROTECTED]]
> >>> Sent: Friday, September 20, 2013 3:50 PM
> >>> To: [EMAIL PROTECTED]
> >>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For
> Help
> >>>
> >>> Hadoop 2.0.0-cdh4.3.1
> >>>
> >>> HBase 0.94.6-cdh4.3.1
> >>>
> >>> 110 servers, 0 dead, 238.2364 average load
> >>>
> >>> Some other info, not sure if it helps or not.
> >>>
> >>> Configured Capacity: 1295277834158080 (1.15 PB)
> >>> Present Capacity: 1224692609430678 (1.09 PB)
> >>> DFS Remaining: 624376503857152 (567.87 TB)
> >>> DFS Used: 600316105573526 (545.98 TB)
> >>> DFS Used%: 49.02%
> >>> Under replicated blocks: 0
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB