Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: how to find top N values using map-reduce ?


Copy link to this message
-
Re: how to find top N values using map-reduce ?
Maybe look at the pig source to see how it does it?

Russell Jurney http://datasyndrome.com

On Feb 1, 2013, at 11:37 PM, praveenesh kumar <[EMAIL PROTECTED]> wrote:

> Thanks for that Russell. Unfortunately I can't use Pig. Need to write
> my own MR job. I was wondering how its usually done in the best way
> possible.
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:
>> Pig. Datafu. 7 lines of code.
>>
>> https://gist.github.com/4696443
>> https://github.com/linkedin/datafu
>>
>>
>> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:
>>
>>> Actually what I am trying to find to top n% of the whole data.
>>> This n could be very large if my data is large.
>>>
>>> Assuming I have uniform rows of equal size and if the total data size
>>> is 10 GB, using the above mentioned approach, if I have to take top
>>> 10% of the whole data set, I need 10% of 10GB which could be rows
>>> worth of 1 GB (roughly) in my mappers.
>>> I think that would not be possible given my input splits are of
>>> 64/128/512 MB (based on my block size) or am I making wrong
>>> assumptions. I can increase the inputsplit size, but is there a better
>>> way to find top n%.
>>>
>>>
>>> My whole actual problem is to give ranks to some values and then find
>>> out the top 10 ranks.
>>>
>>> I think this context can give more idea about the problem ?
>>>
>>> Regards
>>> Praveenesh
>>>
>>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>
>>> wrote:
>>>> Hi,
>>>>
>>>> Can you tell more about:
>>>> * How big is N
>>>> * How big is the input dataset
>>>> * How many mappers you have
>>>> * Do input splits correlate with the sorting criterion for top N?
>>>>
>>>> Depending on the answers, very different strategies will be optimal.
>>>>
>>>>
>>>>
>>>> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]
>>>> wrote:
>>>>
>>>>> I am looking for a better solution for this.
>>>>>
>>>>> 1 way to do this would be to find top N values from each mappers and
>>>>> then find out the top N out of them in 1 reducer.  I am afraid that
>>>>> this won't work effectively if my N is larger than number of values in
>>>>> my inputsplit (or mapper input).
>>>>>
>>>>> Otherway is to just sort all of them in 1 reducer and then do the cat of
>>>>> top-N.
>>>>>
>>>>> Wondering if there is any better approach to do this ?
>>>>>
>>>>> Regards
>>>>> Praveenesh
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Eugene Kirpichov
>>>> http://www.linkedin.com/in/eugenekirpichov
>>>> http://jkff.info/software/timeplotters - my performance visualization
>>> tools
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Russell Jurney 2013-02-02, 07:30
+
Eugene Kirpichov 2013-02-02, 06:23
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB