Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Results from a Map/Reduce


Copy link to this message
-
RE: Results from a Map/Reduce
If there's a customer waiting for the query, then you wouldn't want to have them what for an MR job.

So what you're saying is you want to change this from on-demand scans to using MapReduce to aggregate roll-ups ahead of time and serve those?

In that case, your MR job doesn't need one final output, right?  You could do the Map over the entire table (or start/stop rows depending on schema) and with the appropriate filters.  You would output (customerid + hour bucket) as the key and 1 for the value.  You'd get a reduce for each customerid/hour bucket and would write that to HBase.

One of the ideas behind coprocessors is you could do the per-customer scan/filter/aggregate as a parallel operation inside the RSs (without the overhead of MR or cross-JVM) and might be able to increase the number of rows you can process within a reasonable amount of time.

Another approach to these kinds of aggregates, if you care about realtime at some level, is to use HBase's increment capabilities and a similar hour-bucketed schema but updated on demand instead of in batch.

Yeah, this is a "basic" operation but that only means there are 100 ways to implement it :)

JG

> -----Original Message-----
> From: Peter Haidinyak [mailto:[EMAIL PROTECTED]]
> Sent: Friday, December 17, 2010 12:13 PM
> To: [EMAIL PROTECTED]
> Subject: RE: Results from a Map/Reduce
>
> What I have is basically a query on a log table to return the number of hits per
> hour for customer X for Y days and having the ability to filtering on columns,
> these are to be displayed in a web page on demand.
> Currently, using a Scan, with a popular customer I can get back millions of
> rows to aggregate into 'Hits per hour' buckets. I wanted to push the
> aggregation back to a Map/Reduce and then have those results available to
> send back as a web page.
> This seems like such a basic operation that I am hoping there are 'Best
> Practices' or examples on how to accomplish this. I would also like a pony too.
> :-)
>
> Thanks
>
> -Pete
>
> -----Original Message-----
> From: Jonathan Gray [mailto:[EMAIL PROTECTED]]
> Sent: Friday, December 17, 2010 12:01 PM
> To: [EMAIL PROTECTED]
> Subject: RE: Results from a Map/Reduce
>
> There's not much in the way of examples for coprocessors besides the
> implementation of Security.  Check out HBASE-2000 and go from there.  If
> you're fairly new to HBase, then wait a couple months and there should be
> much better support around Coprocessors.
>
> I'm unsure of a way to have a final result returned back to the main()
> method.  What exactly are you trying to do with this result?  Available to you
> to do what with it?  Could the MR job put the result back into HBase or could
> your reducer contain the logic you need to use with the final result?
>
> > -----Original Message-----
> > From: Peter Haidinyak [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, December 17, 2010 11:56 AM
> > To: [EMAIL PROTECTED]
> > Subject: RE: Results from a Map/Reduce
> >
> > Does that mean that when the job.waitForCompletion(true) returns that
> > I have the results from the Reducer(s) available to me? I haven't seen
> > much on coprocessors, can you point me to some examples of their use?
> >
> > Thanks
> > -Pete
> >
> > -----Original Message-----
> > From: Jonathan Gray [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, December 17, 2010 11:13 AM
> > To: [EMAIL PROTECTED]
> > Subject: RE: Results from a Map/Reduce
> >
> > Hey Peter,
> >
> > That System.exit line is nothing important, just the main thread
> > waiting for the tasks to finish before closing.
> >
> > You're interested in having the MR job return a single result?  To do
> > that, you would need to roll-up the processing done in each of your
> > Map tasks into a single Reduce task.  With one reducer, you can have a
> > single point to do the final aggregation of the result.
> >
> > I'm not sure exactly what kind of aggregation you are doing but
> > funneling into a single reducer can range from no problem to don't
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB