

Re: how to find top N values using mapreduce ?praveenesh kumar 20130202, 07:37
Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way possible. Regards Praveenesh On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote: > Pig. Datafu. 7 lines of code. > > https://gist.github.com/4696443 > https://github.com/linkedin/datafu > > > On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote: > >> Actually what I am trying to find to top n% of the whole data. >> This n could be very large if my data is large. >> >> Assuming I have uniform rows of equal size and if the total data size >> is 10 GB, using the above mentioned approach, if I have to take top >> 10% of the whole data set, I need 10% of 10GB which could be rows >> worth of 1 GB (roughly) in my mappers. >> I think that would not be possible given my input splits are of >> 64/128/512 MB (based on my block size) or am I making wrong >> assumptions. I can increase the inputsplit size, but is there a better >> way to find top n%. >> >> >> My whole actual problem is to give ranks to some values and then find >> out the top 10 ranks. >> >> I think this context can give more idea about the problem ? >> >> Regards >> Praveenesh >> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> >> wrote: >> > Hi, >> > >> > Can you tell more about: >> > * How big is N >> > * How big is the input dataset >> > * How many mappers you have >> > * Do input splits correlate with the sorting criterion for top N? >> > >> > Depending on the answers, very different strategies will be optimal. >> > >> > >> > >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED] >> >wrote: >> > >> >> I am looking for a better solution for this. >> >> >> >> 1 way to do this would be to find top N values from each mappers and >> >> then find out the top N out of them in 1 reducer. I am afraid that >> >> this won't work effectively if my N is larger than number of values in >> >> my inputsplit (or mapper input). >> >> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat of >> >> topN. >> >> >> >> Wondering if there is any better approach to do this ? >> >> >> >> Regards >> >> Praveenesh >> >> >> > >> > >> > >> >  >> > Eugene Kirpichov >> > http://www.linkedin.com/in/eugenekirpichov >> > http://jkff.info/software/timeplotters  my performance visualization >> tools >> > > > >  > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com 