praveenesh kumar
20130202, 05:05
praveenesh kumar
20130202, 07:17
praveenesh kumar
20130202, 07:37
Lake Chang
20130202, 08:12
praveenesh kumar
20130202, 11:07
Niels Basjes
20130202, 12:44
Harsh J
20130202, 18:05


how to find top N values using mapreduce ?praveenesh kumar 20130202, 05:05
I am looking for a better solution for this.
1 way to do this would be to find top N values from each mappers and then find out the top N out of them in 1 reducer. I am afraid that this won't work effectively if my N is larger than number of values in my inputsplit (or mapper input). Otherway is to just sort all of them in 1 reducer and then do the cat of topN. Wondering if there is any better approach to do this ? Regards Praveenesh 
Re: how to find top N values using mapreduce ?praveenesh kumar 20130202, 07:17
Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large. Assuming I have uniform rows of equal size and if the total data size is 10 GB, using the above mentioned approach, if I have to take top 10% of the whole data set, I need 10% of 10GB which could be rows worth of 1 GB (roughly) in my mappers. I think that would not be possible given my input splits are of 64/128/512 MB (based on my block size) or am I making wrong assumptions. I can increase the inputsplit size, but is there a better way to find top n%. My whole actual problem is to give ranks to some values and then find out the top 10 ranks. I think this context can give more idea about the problem ? Regards Praveenesh On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> wrote: > Hi, > > Can you tell more about: > * How big is N > * How big is the input dataset > * How many mappers you have > * Do input splits correlate with the sorting criterion for top N? > > Depending on the answers, very different strategies will be optimal. > > > > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote: > >> I am looking for a better solution for this. >> >> 1 way to do this would be to find top N values from each mappers and >> then find out the top N out of them in 1 reducer. I am afraid that >> this won't work effectively if my N is larger than number of values in >> my inputsplit (or mapper input). >> >> Otherway is to just sort all of them in 1 reducer and then do the cat of >> topN. >> >> Wondering if there is any better approach to do this ? >> >> Regards >> Praveenesh >> > > > >  > Eugene Kirpichov > http://www.linkedin.com/in/eugenekirpichov > http://jkff.info/software/timeplotters  my performance visualization tools 
Re: how to find top N values using mapreduce ?praveenesh kumar 20130202, 07:37
Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way possible. Regards Praveenesh On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote: > Pig. Datafu. 7 lines of code. > > https://gist.github.com/4696443 > https://github.com/linkedin/datafu > > > On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote: > >> Actually what I am trying to find to top n% of the whole data. >> This n could be very large if my data is large. >> >> Assuming I have uniform rows of equal size and if the total data size >> is 10 GB, using the above mentioned approach, if I have to take top >> 10% of the whole data set, I need 10% of 10GB which could be rows >> worth of 1 GB (roughly) in my mappers. >> I think that would not be possible given my input splits are of >> 64/128/512 MB (based on my block size) or am I making wrong >> assumptions. I can increase the inputsplit size, but is there a better >> way to find top n%. >> >> >> My whole actual problem is to give ranks to some values and then find >> out the top 10 ranks. >> >> I think this context can give more idea about the problem ? >> >> Regards >> Praveenesh >> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> >> wrote: >> > Hi, >> > >> > Can you tell more about: >> > * How big is N >> > * How big is the input dataset >> > * How many mappers you have >> > * Do input splits correlate with the sorting criterion for top N? >> > >> > Depending on the answers, very different strategies will be optimal. >> > >> > >> > >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED] >> >wrote: >> > >> >> I am looking for a better solution for this. >> >> >> >> 1 way to do this would be to find top N values from each mappers and >> >> then find out the top N out of them in 1 reducer. I am afraid that >> >> this won't work effectively if my N is larger than number of values in >> >> my inputsplit (or mapper input). >> >> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat of >> >> topN. >> >> >> >> Wondering if there is any better approach to do this ? >> >> >> >> Regards >> >> Praveenesh >> >> >> > >> > >> > >> >  >> > Eugene Kirpichov >> > http://www.linkedin.com/in/eugenekirpichov >> > http://jkff.info/software/timeplotters  my performance visualization >> tools >> > > > >  > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com 
Re: how to find top N values using mapreduce ?Lake Chang 20130202, 08:12
there's one thing i want to clarify that you can use multireducers to sort
the data globally and then cat all the parts to get the top n records. The data in all parts are globally in order. Then you may find the problem is much easier. 在 201322 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道： > Actually what I am trying to find to top n% of the whole data. > This n could be very large if my data is large. > > Assuming I have uniform rows of equal size and if the total data size > is 10 GB, using the above mentioned approach, if I have to take top > 10% of the whole data set, I need 10% of 10GB which could be rows > worth of 1 GB (roughly) in my mappers. > I think that would not be possible given my input splits are of > 64/128/512 MB (based on my block size) or am I making wrong > assumptions. I can increase the inputsplit size, but is there a better > way to find top n%. > > > My whole actual problem is to give ranks to some values and then find > out the top 10 ranks. > > I think this context can give more idea about the problem ? > > Regards > Praveenesh > > On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > Can you tell more about: > > * How big is N > > * How big is the input dataset > > * How many mappers you have > > * Do input splits correlate with the sorting criterion for top N? > > > > Depending on the answers, very different strategies will be optimal. > > > > > > > > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED] > >wrote: > > > >> I am looking for a better solution for this. > >> > >> 1 way to do this would be to find top N values from each mappers and > >> then find out the top N out of them in 1 reducer. I am afraid that > >> this won't work effectively if my N is larger than number of values in > >> my inputsplit (or mapper input). > >> > >> Otherway is to just sort all of them in 1 reducer and then do the cat of > >> topN. > >> > >> Wondering if there is any better approach to do this ? > >> > >> Regards > >> Praveenesh > >> > > > > > > > >  > > Eugene Kirpichov > > http://www.linkedin.com/in/eugenekirpichov > > http://jkff.info/software/timeplotters  my performance visualization > tools > 
Re: how to find top N values using mapreduce ?praveenesh kumar 20130202, 11:07
My actual problem is to rank all values and then run logic 1 to top n%
values and logic 2 to rest values. 1st  Ranking ? (need major suggestions here) 2nd  Find top n% out of them. Then rest is covered. Regards Praveenesh On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote: > there's one thing i want to clarify that you can use multireducers to sort > the data globally and then cat all the parts to get the top n records. The > data in all parts are globally in order. > Then you may find the problem is much easier. > > 在 201322 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道： > >> Actually what I am trying to find to top n% of the whole data. >> This n could be very large if my data is large. >> >> Assuming I have uniform rows of equal size and if the total data size >> is 10 GB, using the above mentioned approach, if I have to take top >> 10% of the whole data set, I need 10% of 10GB which could be rows >> worth of 1 GB (roughly) in my mappers. >> I think that would not be possible given my input splits are of >> 64/128/512 MB (based on my block size) or am I making wrong >> assumptions. I can increase the inputsplit size, but is there a better >> way to find top n%. >> >> >> My whole actual problem is to give ranks to some values and then find >> out the top 10 ranks. >> >> I think this context can give more idea about the problem ? >> >> Regards >> Praveenesh >> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> >> wrote: >> > Hi, >> > >> > Can you tell more about: >> > * How big is N >> > * How big is the input dataset >> > * How many mappers you have >> > * Do input splits correlate with the sorting criterion for top N? >> > >> > Depending on the answers, very different strategies will be optimal. >> > >> > >> > >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar >> > <[EMAIL PROTECTED]>wrote: >> > >> >> I am looking for a better solution for this. >> >> >> >> 1 way to do this would be to find top N values from each mappers and >> >> then find out the top N out of them in 1 reducer. I am afraid that >> >> this won't work effectively if my N is larger than number of values in >> >> my inputsplit (or mapper input). >> >> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat >> >> of >> >> topN. >> >> >> >> Wondering if there is any better approach to do this ? >> >> >> >> Regards >> >> Praveenesh >> >> >> > >> > >> > >> >  >> > Eugene Kirpichov >> > http://www.linkedin.com/in/eugenekirpichov >> > http://jkff.info/software/timeplotters  my performance visualization >> > tools 
Re: how to find top N values using mapreduce ?Niels Basjes 20130202, 12:44
My suggestion is to use secondary sort with a single reducer. That easy you
can easily extract the top N. If you want to get the top N% you'll need an additional phase to determine how many records this N% really is.  Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 2 feb. 2013 12:08 schreef "praveenesh kumar" <[EMAIL PROTECTED]> het volgende: > My actual problem is to rank all values and then run logic 1 to top n% > values and logic 2 to rest values. > 1st  Ranking ? (need major suggestions here) > 2nd  Find top n% out of them. > Then rest is covered. > > Regards > Praveenesh > > On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote: > > there's one thing i want to clarify that you can use multireducers to > sort > > the data globally and then cat all the parts to get the top n records. > The > > data in all parts are globally in order. > > Then you may find the problem is much easier. > > > > 在 201322 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道： > > > >> Actually what I am trying to find to top n% of the whole data. > >> This n could be very large if my data is large. > >> > >> Assuming I have uniform rows of equal size and if the total data size > >> is 10 GB, using the above mentioned approach, if I have to take top > >> 10% of the whole data set, I need 10% of 10GB which could be rows > >> worth of 1 GB (roughly) in my mappers. > >> I think that would not be possible given my input splits are of > >> 64/128/512 MB (based on my block size) or am I making wrong > >> assumptions. I can increase the inputsplit size, but is there a better > >> way to find top n%. > >> > >> > >> My whole actual problem is to give ranks to some values and then find > >> out the top 10 ranks. > >> > >> I think this context can give more idea about the problem ? > >> > >> Regards > >> Praveenesh > >> > >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED] > > > >> wrote: > >> > Hi, > >> > > >> > Can you tell more about: > >> > * How big is N > >> > * How big is the input dataset > >> > * How many mappers you have > >> > * Do input splits correlate with the sorting criterion for top N? > >> > > >> > Depending on the answers, very different strategies will be optimal. > >> > > >> > > >> > > >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar > >> > <[EMAIL PROTECTED]>wrote: > >> > > >> >> I am looking for a better solution for this. > >> >> > >> >> 1 way to do this would be to find top N values from each mappers and > >> >> then find out the top N out of them in 1 reducer. I am afraid that > >> >> this won't work effectively if my N is larger than number of values > in > >> >> my inputsplit (or mapper input). > >> >> > >> >> Otherway is to just sort all of them in 1 reducer and then do the cat > >> >> of > >> >> topN. > >> >> > >> >> Wondering if there is any better approach to do this ? > >> >> > >> >> Regards > >> >> Praveenesh > >> >> > >> > > >> > > >> > > >> >  > >> > Eugene Kirpichov > >> > http://www.linkedin.com/in/eugenekirpichov > >> > http://jkff.info/software/timeplotters  my performance visualization > >> > tools > 
Re: how to find top N values using mapreduce ?Harsh J 20130202, 18:05
Note that a "one reducer" isn't always the solution. If you know your
key space boundaries, consider using a totalorderpartition to scale the app/job and make use of nodes on the cluster. On Sat, Feb 2, 2013 at 10:35 AM, praveenesh kumar <[EMAIL PROTECTED]> wrote: > I am looking for a better solution for this. > > 1 way to do this would be to find top N values from each mappers and > then find out the top N out of them in 1 reducer. I am afraid that > this won't work effectively if my N is larger than number of values in > my inputsplit (or mapper input). > > Otherway is to just sort all of them in 1 reducer and then do the cat of topN. > > Wondering if there is any better approach to do this ? > > Regards > Praveenesh  Harsh J 
