Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: setNumReduceTasks(1)


Copy link to this message
-
Re: setNumReduceTasks(1)
Something Something 2010-01-30, 02:32
N could be up to 1000, and output from Map job could be about 5 Million.  We
only want the top 1000 because rest of it could be just noise.  Thanks for
your help.

On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[EMAIL PROTECTED]>wrote:

> How big is N?  How big is outcome of Map job?
>
> Alex.
>
> On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
> [EMAIL PROTECTED]> wrote:
>
> > I am sorry, but I forgot to add one important piece of information.
> >
> > I don't want to write any random N rows to the table.  I want to write
> the
> > *top* N rows - meaning - I want to write the "key" values of the Reducer
> in
> > descending order.  Does this make sense?  Sorry for the confusion.
> >
> > On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
> > [EMAIL PROTECTED]
> > > wrote:
> >
> > >
> > > A possible solution is to emit only N rows from each mapper and then
> use
> > 1
> > > reduce task [*] - if value of N is not very high.
> > > So you end up with utmost m * N rows on reducer instead of full
> inputset
> > -
> > > and so the limit can be done easier.
> > >
> > >
> > > If you ok with some sort of variance in the number of rows inserted
> (and
> > if
> > > value of N is very high), you can do more interesting things like N/m'
> > rows
> > > per mapper - and multiple reducers (r) : with assumtion that each
> reducer
> > > will see atleast N/r rows - and so you can limit to N/r per reducer :
> > > ofcourse, there is a possible error that gets introduced here ...
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > > [*] Assuming you just want simple limit - nothing else.
> > > Also note, each mapper might want to emit N rows instead of 'tweaks'
> like
> > > N/m rows, since it is possible that multiple mappers might have less
> than
> > > N/m rows to emit to begin with !
> > >
> > >
> > >
> > > Something Something wrote:
> > >
> > >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
> > >> class
> > >> be instantiated only on one machine.. always?  I mean if I have a
> > cluster
> > >> of
> > >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class
> guaranteed
> > >> to
> > >> be instantiated only on 1 machine?
> > >>
> > >> If answer is yes, then I will use static variable as a counter to see
> > how
> > >> may rows have been added to my HBase table so far.  In my use case, I
> > want
> > >> to write only N number of rows to a table.  Is there a better way to
> do
> > >> this?  Please let me know.  Thanks.
> > >>
> > >
> > >
> >
>