Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Results from a Map/Reduce


Copy link to this message
-
RE: Results from a Map/Reduce
If you do aggregation, your queries will most likely be well under a second.  The aggregates should reduce the amount of data that needs to be read by several orders of magnitude, no?

> -----Original Message-----
> From: Peter Haidinyak [mailto:[EMAIL PROTECTED]]
> Sent: Friday, December 17, 2010 1:43 PM
> To: [EMAIL PROTECTED]
> Subject: RE: Results from a Map/Reduce
>
> So the idea is to aggregate the final result to an HBase Table and then from
> the client query that table. I'm going to have to find a quicker method.
> Currently on my small three node cluster with 100million rows it takes a
> couple of minutes to do a scan that brings back several million rows. My boss
> wants the query to be in the 'less than five second' range.
>
> Thanks
>
> -Pete
>
> -----Original Message-----
> From: Jonathan Gray [mailto:[EMAIL PROTECTED]]
> Sent: Friday, December 17, 2010 1:19 PM
> To: [EMAIL PROTECTED]
> Subject: RE: Results from a Map/Reduce
>
> If there's a customer waiting for the query, then you wouldn't want to have
> them what for an MR job.
>
> So what you're saying is you want to change this from on-demand scans to
> using MapReduce to aggregate roll-ups ahead of time and serve those?
>
> In that case, your MR job doesn't need one final output, right?  You could do
> the Map over the entire table (or start/stop rows depending on schema) and
> with the appropriate filters.  You would output (customerid + hour bucket) as
> the key and 1 for the value.  You'd get a reduce for each customerid/hour
> bucket and would write that to HBase.
>
> One of the ideas behind coprocessors is you could do the per-customer
> scan/filter/aggregate as a parallel operation inside the RSs (without the
> overhead of MR or cross-JVM) and might be able to increase the number of
> rows you can process within a reasonable amount of time.
>
> Another approach to these kinds of aggregates, if you care about realtime at
> some level, is to use HBase's increment capabilities and a similar hour-
> bucketed schema but updated on demand instead of in batch.
>
> Yeah, this is a "basic" operation but that only means there are 100 ways to
> implement it :)
>
> JG
>
> > -----Original Message-----
> > From: Peter Haidinyak [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, December 17, 2010 12:13 PM
> > To: [EMAIL PROTECTED]
> > Subject: RE: Results from a Map/Reduce
> >
> > What I have is basically a query on a log table to return the number
> > of hits per hour for customer X for Y days and having the ability to
> > filtering on columns, these are to be displayed in a web page on demand.
> > Currently, using a Scan, with a popular customer I can get back
> > millions of rows to aggregate into 'Hits per hour' buckets. I wanted
> > to push the aggregation back to a Map/Reduce and then have those
> > results available to send back as a web page.
> > This seems like such a basic operation that I am hoping there are
> > 'Best Practices' or examples on how to accomplish this. I would also like a
> pony too.
> > :-)
> >
> > Thanks
> >
> > -Pete
> >
> > -----Original Message-----
> > From: Jonathan Gray [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, December 17, 2010 12:01 PM
> > To: [EMAIL PROTECTED]
> > Subject: RE: Results from a Map/Reduce
> >
> > There's not much in the way of examples for coprocessors besides the
> > implementation of Security.  Check out HBASE-2000 and go from there.
> > If you're fairly new to HBase, then wait a couple months and there
> > should be much better support around Coprocessors.
> >
> > I'm unsure of a way to have a final result returned back to the main()
> > method.  What exactly are you trying to do with this result?
> > Available to you to do what with it?  Could the MR job put the result
> > back into HBase or could your reducer contain the logic you need to use
> with the final result?
> >
> > > -----Original Message-----
> > > From: Peter Haidinyak [mailto:[EMAIL PROTECTED]]
> > > Sent: Friday, December 17, 2010 11:56 AM