Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> how to find top N values using map-reduce ?


Copy link to this message
-
Re: how to find top N values using map-reduce ?
there's one thing i want to clarify that you can use multi-reducers to sort
the data globally and then cat all the parts to get the top n records. The
data in all parts are globally in order.
Then you may find the problem is much easier.
在 2013-2-2 下午3:18,"praveenesh kumar" <[EMAIL PROTECTED]>写道:

> Actually what I am trying to find to top n% of the whole data.
> This n could be very large if my data is large.
>
> Assuming I have uniform rows of equal size and if the total data size
> is 10 GB, using the above mentioned approach, if I have to take top
> 10% of the whole data set, I need 10% of 10GB which could be rows
> worth of 1 GB (roughly) in my mappers.
> I think that would not be possible given my input splits are of
> 64/128/512 MB (based on my block size) or am I making wrong
> assumptions. I can increase the inputsplit size, but is there a better
> way to find top n%.
>
>
> My whole actual problem is to give ranks to some values and then find
> out the top 10 ranks.
>
> I think this context can give more idea about the problem ?
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>
> wrote:
> > Hi,
> >
> > Can you tell more about:
> >  * How big is N
> >  * How big is the input dataset
> >  * How many mappers you have
> >  * Do input splits correlate with the sorting criterion for top N?
> >
> > Depending on the answers, very different strategies will be optimal.
> >
> >
> >
> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]
> >wrote:
> >
> >> I am looking for a better solution for this.
> >>
> >> 1 way to do this would be to find top N values from each mappers and
> >> then find out the top N out of them in 1 reducer.  I am afraid that
> >> this won't work effectively if my N is larger than number of values in
> >> my inputsplit (or mapper input).
> >>
> >> Otherway is to just sort all of them in 1 reducer and then do the cat of
> >> top-N.
> >>
> >> Wondering if there is any better approach to do this ?
> >>
> >> Regards
> >> Praveenesh
> >>
> >
> >
> >
> > --
> > Eugene Kirpichov
> > http://www.linkedin.com/in/eugenekirpichov
> > http://jkff.info/software/timeplotters - my performance visualization
> tools
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB