Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Re: setNumReduceTasks(1)


Copy link to this message
-
Re: setNumReduceTasks(1)
N could be up to 1000, and output from Map job could be about 5 Million.  We
only want the top 1000 because rest of it could be just noise.  Thanks for
your help.

On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[EMAIL PROTECTED]>wrote:

> How big is N?  How big is outcome of Map job?
>
> Alex.
>
> On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
> [EMAIL PROTECTED]> wrote:
>
> > I am sorry, but I forgot to add one important piece of information.
> >
> > I don't want to write any random N rows to the table.  I want to write
> the
> > *top* N rows - meaning - I want to write the "key" values of the Reducer
> in
> > descending order.  Does this make sense?  Sorry for the confusion.
> >
> > On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
> > [EMAIL PROTECTED]
> > > wrote:
> >
> > >
> > > A possible solution is to emit only N rows from each mapper and then
> use
> > 1
> > > reduce task [*] - if value of N is not very high.
> > > So you end up with utmost m * N rows on reducer instead of full
> inputset
> > -
> > > and so the limit can be done easier.
> > >
> > >
> > > If you ok with some sort of variance in the number of rows inserted
> (and
> > if
> > > value of N is very high), you can do more interesting things like N/m'
> > rows
> > > per mapper - and multiple reducers (r) : with assumtion that each
> reducer
> > > will see atleast N/r rows - and so you can limit to N/r per reducer :
> > > ofcourse, there is a possible error that gets introduced here ...
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > > [*] Assuming you just want simple limit - nothing else.
> > > Also note, each mapper might want to emit N rows instead of 'tweaks'
> like
> > > N/m rows, since it is possible that multiple mappers might have less
> than
> > > N/m rows to emit to begin with !
> > >
> > >
> > >
> > > Something Something wrote:
> > >
> > >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
> > >> class
> > >> be instantiated only on one machine.. always?  I mean if I have a
> > cluster
> > >> of
> > >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class
> guaranteed
> > >> to
> > >> be instantiated only on 1 machine?
> > >>
> > >> If answer is yes, then I will use static variable as a counter to see
> > how
> > >> may rows have been added to my HBase table so far.  In my use case, I
> > want
> > >> to write only N number of rows to a table.  Is there a better way to
> do
> > >> this?  Please let me know.  Thanks.
> > >>
> > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB