praveenesh kumar

2013-02-02, 05:05

Eugene Kirpichov

2013-02-02, 06:23

praveenesh kumar

2013-02-02, 07:17

Russell Jurney

2013-02-02, 07:30

praveenesh kumar

2013-02-02, 07:37

Russell Jurney

2013-02-02, 08:10

praveenesh kumar

2013-02-02, 11:07

Niels Basjes

2013-02-02, 12:44

- Hadoop
- mail # user
- how to find top N values using map-reduce ?

I am looking for a better solution for this.

1 way to do this would be to find top N values from each mappers and

then find out the top N out of them in 1 reducer. I am afraid that

this won't work effectively if my N is larger than number of values in

my inputsplit (or mapper input).

Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.

Wondering if there is any better approach to do this ?

Regards

Praveenesh

Hi,

Can you tell more about:

* How big is N

* How big is the input dataset

* How many mappers you have

* Do input splits correlate with the sorting criterion for top N?

Depending on the answers, very different strategies will be optimal.

Actually what I am trying to find to top n% of the whole data.

This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size

is 10 GB, using the above mentioned approach, if I have to take top

10% of the whole data set, I need 10% of 10GB which could be rows

worth of 1 GB (roughly) in my mappers.

I think that would not be possible given my input splits are of

64/128/512 MB (based on my block size) or am I making wrong

assumptions. I can increase the inputsplit size, but is there a better

way to find top n%.

My whole actual problem is to give ranks to some values and then find

out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards

Praveenesh

Pig. Datafu. 7 lines of code.

https://gist.github.com/4696443

https://github.com/linkedin/datafu

>

--

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

https://gist.github.com/4696443

https://github.com/linkedin/datafu

>

--

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

Thanks for that Russell. Unfortunately I can't use Pig. Need to write

my own MR job. I was wondering how its usually done in the best way

possible.

Regards

Praveenesh

my own MR job. I was wondering how its usually done in the best way

possible.

Regards

Praveenesh

Maybe look at the pig source to see how it does it?

Russell Jurney http://datasyndrome.com

Russell Jurney http://datasyndrome.com

My actual problem is to rank all values and then run logic 1 to top n%

values and logic 2 to rest values.

1st - Ranking ? (need major suggestions here)

2nd - Find top n% out of them.

Then rest is covered.

Regards

Praveenesh

values and logic 2 to rest values.

1st - Ranking ? (need major suggestions here)

2nd - Find top n% out of them.

Then rest is covered.

Regards

Praveenesh

My suggestion is to use secondary sort with a single reducer. That easy you

can easily extract the top N. If you want to get the top N% you'll need an

additional phase to determine how many records this N% really is.

--

Met vriendelijke groet,

Niels Basjes

(Verstuurd vanaf mobiel )

>

can easily extract the top N. If you want to get the top N% you'll need an

additional phase to determine how many records this N% really is.

--

Met vriendelijke groet,

Niels Basjes

(Verstuurd vanaf mobiel )

>

