Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - multiple puts in reducer?


Copy link to this message
-
Re: multiple puts in reducer?
Jacques 2012-03-01, 17:28
The data flow is what matters.  The reduce phase is about sorting output.
 If you push puts to HBase, the input to HBase doesn't have to be sorted
since HBase does a sort no matter what.  So using a reducer to sort an
output is overkill if you're simply putting those same objects into HBase.
 On the flip side, if your reducer is doing real work that can't be done in
your mapper and the HBase client can't do, then go ahead and use the
reducer.

>Reducers are expensive. Running multiple mappers in a job can be cheaper.
Expounding, all reducers by definition have to wait until all mappers are
done before they can actually starting running the reduce method.  (Shuffle
can start before this).  If you don't pick your partitioner correctly, a
small number of reducers may be doing lots of work--basically making your
job less paralell than you imagined. A simple rule that one could use is
don't use a reducer unless you must do a parallel sort of a large amount of
data.  (Smaller sorts e.g. the map-side join work if one of the join sides
fit into memory).
On Wed, Feb 29, 2012 at 5:04 AM, Michel Segel <[EMAIL PROTECTED]>wrote:

> The assertion is that for most cases you shouldn't need one. That the rule
> of thumb is that you should have to defend your use of one.
>
> Reducers are expensive. Running multiple mappers in a job can be cheaper.
>
> All I am saying is that you need to rethink your solution if you insist on
> using a reducer.
>
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Feb 28, 2012, at 11:40 AM, Ben Snively <[EMAIL PROTECTED]> wrote:
>
> > Is there an assertion that you would never need to run a reducer when
> > writing to the DB?
> >
> > It seems that there are cases when you would not need one, but the
> general
> > statement doesn't apply to all use cases.
> >
> > If you were trying to process data where you may have two a map task (or
> > set of map tasks) output the same key,  you could have a case where you
> > need to reduce the data for that key prior to insert the result into
> hbase.
> >
> > Am I missing something, but to me, that would be the deciding factor.  If
> > the key/values output in the map task are the exact values that need to
> be
> > inserted into HBase versus multiple values aggregated together and the
> > results put into the hbase entry?
> >
> > Thanks,
> > Ben
> >
> >
> > On Tue, Feb 28, 2012 at 11:20 AM, Michael Segel
> > <[EMAIL PROTECTED]>wrote:
> >
> >> The better question is why would you need a reducer?
> >>
> >> That's a bit cryptic, I understand, but you have to ask yourself when do
> >> you need to use a reducer when you are writing to a database... ;-)
> >>
> >>
> >> Sent from my iPhone
> >>
> >> On Feb 28, 2012, at 10:14 AM, "T Vinod Gupta" <[EMAIL PROTECTED]>
> >> wrote:
> >>
> >>> Mike,
> >>> I didn't understand - why would I not need reducer in hbase m/r? there
> >> can
> >>> be cases right.
> >>> My use case is very similar to Sujee's blog on frequency counting -
> >>> http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/
> >>> So in the reducer, I can do all the aggregations. Is there a better
> way?
> >> I
> >>> can think of another way - to use increments in the map job itself. i
> >> have
> >>> to figure out if thats possible though.
> >>>
> >>> thanks
> >>>
> >>> On Tue, Feb 28, 2012 at 7:44 AM, Michel Segel <
> [EMAIL PROTECTED]
> >>> wrote:
> >>>
> >>>> Yes you can do it.
> >>>> But why do you have a reducer when running a m/r job against HBase?
> >>>>
> >>>> The trick in writing multiple rows... You do it independently of the
> >>>> output from the map() method.
> >>>>
> >>>>
> >>>> Sent from a remote device. Please excuse any typos...
> >>>>
> >>>> Mike Segel
> >>>>
> >>>> On Feb 28, 2012, at 8:34 AM, T Vinod Gupta <[EMAIL PROTECTED]>
> >> wrote:
> >>>>
> >>>>> while doing map reduce on hbase tables, is it possible to do multiple
> >>>> puts
> >>>>> in the reducer? what i want is a way to be able to write multiple