Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: how to find top N values using map-reduce ?


Copy link to this message
-
Re: how to find top N values using map-reduce ?
Russell Jurney 2013-02-02, 07:30
Pig. Datafu. 7 lines of code.

https://gist.github.com/4696443
https://github.com/linkedin/datafu
On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> Actually what I am trying to find to top n% of the whole data.
> This n could be very large if my data is large.
>
> Assuming I have uniform rows of equal size and if the total data size
> is 10 GB, using the above mentioned approach, if I have to take top
> 10% of the whole data set, I need 10% of 10GB which could be rows
> worth of 1 GB (roughly) in my mappers.
> I think that would not be possible given my input splits are of
> 64/128/512 MB (based on my block size) or am I making wrong
> assumptions. I can increase the inputsplit size, but is there a better
> way to find top n%.
>
>
> My whole actual problem is to give ranks to some values and then find
> out the top 10 ranks.
>
> I think this context can give more idea about the problem ?
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>
> wrote:
> > Hi,
> >
> > Can you tell more about:
> >  * How big is N
> >  * How big is the input dataset
> >  * How many mappers you have
> >  * Do input splits correlate with the sorting criterion for top N?
> >
> > Depending on the answers, very different strategies will be optimal.
> >
> >
> >
> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]
> >wrote:
> >
> >> I am looking for a better solution for this.
> >>
> >> 1 way to do this would be to find top N values from each mappers and
> >> then find out the top N out of them in 1 reducer.  I am afraid that
> >> this won't work effectively if my N is larger than number of values in
> >> my inputsplit (or mapper input).
> >>
> >> Otherway is to just sort all of them in 1 reducer and then do the cat of
> >> top-N.
> >>
> >> Wondering if there is any better approach to do this ?
> >>
> >> Regards
> >> Praveenesh
> >>
> >
> >
> >
> > --
> > Eugene Kirpichov
> > http://www.linkedin.com/in/eugenekirpichov
> > http://jkff.info/software/timeplotters - my performance visualization
> tools
>

--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com