Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> how to find top N values using map-reduce ?


Copy link to this message
-
Re: how to find top N values using map-reduce ?
Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB which could be rows
worth of 1 GB (roughly) in my mappers.
I think that would not be possible given my input splits are of
64/128/512 MB (based on my block size) or am I making wrong
assumptions. I can increase the inputsplit size, but is there a better
way to find top n%.
My whole actual problem is to give ranks to some values and then find
out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards
Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Can you tell more about:
>  * How big is N
>  * How big is the input dataset
>  * How many mappers you have
>  * Do input splits correlate with the sorting criterion for top N?
>
> Depending on the answers, very different strategies will be optimal.
>
>
>
> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:
>
>> I am looking for a better solution for this.
>>
>> 1 way to do this would be to find top N values from each mappers and
>> then find out the top N out of them in 1 reducer.  I am afraid that
>> this won't work effectively if my N is larger than number of values in
>> my inputsplit (or mapper input).
>>
>> Otherway is to just sort all of them in 1 reducer and then do the cat of
>> top-N.
>>
>> Wondering if there is any better approach to do this ?
>>
>> Regards
>> Praveenesh
>>
>
>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
> http://jkff.info/software/timeplotters - my performance visualization tools
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB