praveenesh kumar

2013-02-02, 05:05

Eugene Kirpichov

2013-02-02, 06:23

praveenesh kumar

2013-02-02, 07:17

Russell Jurney

2013-02-02, 07:30

praveenesh kumar

2013-02-02, 07:37

Russell Jurney

2013-02-02, 08:10

praveenesh kumar

2013-02-02, 11:07

Niels Basjes

2013-02-02, 12:44

- Hadoop
- mail # user
- how to find top N values using map-reduce ?

I am looking for a better solution for this.

1 way to do this would be to find top N values from each mappers and

then find out the top N out of them in 1 reducer. I am afraid that

this won't work effectively if my N is larger than number of values in

my inputsplit (or mapper input).

Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.

Wondering if there is any better approach to do this ?

Regards

Praveenesh

1 way to do this would be to find top N values from each mappers and

then find out the top N out of them in 1 reducer. I am afraid that

this won't work effectively if my N is larger than number of values in

my inputsplit (or mapper input).

Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.

Wondering if there is any better approach to do this ?

Regards

Praveenesh

Hi,

Can you tell more about:

* How big is N

* How big is the input dataset

* How many mappers you have

* Do input splits correlate with the sorting criterion for top N?

Depending on the answers, very different strategies will be optimal.

On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> I am looking for a better solution for this.

>

> 1 way to do this would be to find top N values from each mappers and

> then find out the top N out of them in 1 reducer. I am afraid that

> this won't work effectively if my N is larger than number of values in

> my inputsplit (or mapper input).

>

> Otherway is to just sort all of them in 1 reducer and then do the cat of

> top-N.

>

> Wondering if there is any better approach to do this ?

>

> Regards

> Praveenesh

>

--

Eugene Kirpichov

http://www.linkedin.com/in/eugenekirpichov

http://jkff.info/software/timeplotters - my performance visualization tools

Can you tell more about:

* How big is N

* How big is the input dataset

* How many mappers you have

* Do input splits correlate with the sorting criterion for top N?

Depending on the answers, very different strategies will be optimal.

On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> I am looking for a better solution for this.

>

> 1 way to do this would be to find top N values from each mappers and

> then find out the top N out of them in 1 reducer. I am afraid that

> this won't work effectively if my N is larger than number of values in

> my inputsplit (or mapper input).

>

> Otherway is to just sort all of them in 1 reducer and then do the cat of

> top-N.

>

> Wondering if there is any better approach to do this ?

>

> Regards

> Praveenesh

>

--

Eugene Kirpichov

http://www.linkedin.com/in/eugenekirpichov

http://jkff.info/software/timeplotters - my performance visualization tools

Actually what I am trying to find to top n% of the whole data.

This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size

is 10 GB, using the above mentioned approach, if I have to take top

10% of the whole data set, I need 10% of 10GB which could be rows

worth of 1 GB (roughly) in my mappers.

I think that would not be possible given my input splits are of

64/128/512 MB (based on my block size) or am I making wrong

assumptions. I can increase the inputsplit size, but is there a better

way to find top n%.

My whole actual problem is to give ranks to some values and then find

out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards

Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> wrote:

> Hi,

>

> Can you tell more about:

> * How big is N

> * How big is the input dataset

> * How many mappers you have

> * Do input splits correlate with the sorting criterion for top N?

>

> Depending on the answers, very different strategies will be optimal.

>

>

>

> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>

>> I am looking for a better solution for this.

>>

>> 1 way to do this would be to find top N values from each mappers and

>> then find out the top N out of them in 1 reducer. I am afraid that

>> this won't work effectively if my N is larger than number of values in

>> my inputsplit (or mapper input).

>>

>> Otherway is to just sort all of them in 1 reducer and then do the cat of

>> top-N.

>>

>> Wondering if there is any better approach to do this ?

>>

>> Regards

>> Praveenesh

>>

>

>

>

> --

> Eugene Kirpichov

> http://www.linkedin.com/in/eugenekirpichov

> http://jkff.info/software/timeplotters - my performance visualization tools

This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size

is 10 GB, using the above mentioned approach, if I have to take top

10% of the whole data set, I need 10% of 10GB which could be rows

worth of 1 GB (roughly) in my mappers.

I think that would not be possible given my input splits are of

64/128/512 MB (based on my block size) or am I making wrong

assumptions. I can increase the inputsplit size, but is there a better

way to find top n%.

My whole actual problem is to give ranks to some values and then find

out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards

Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> wrote:

> Hi,

>

> Can you tell more about:

> * How big is N

> * How big is the input dataset

> * How many mappers you have

> * Do input splits correlate with the sorting criterion for top N?

>

> Depending on the answers, very different strategies will be optimal.

>

>

>

> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>

>> I am looking for a better solution for this.

>>

>> 1 way to do this would be to find top N values from each mappers and

>> then find out the top N out of them in 1 reducer. I am afraid that

>> this won't work effectively if my N is larger than number of values in

>> my inputsplit (or mapper input).

>>

>> Otherway is to just sort all of them in 1 reducer and then do the cat of

>> top-N.

>>

>> Wondering if there is any better approach to do this ?

>>

>> Regards

>> Praveenesh

>>

>

>

>

> --

> Eugene Kirpichov

> http://www.linkedin.com/in/eugenekirpichov

> http://jkff.info/software/timeplotters - my performance visualization tools

Pig. Datafu. 7 lines of code.

https://gist.github.com/4696443

https://github.com/linkedin/datafu

On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> Actually what I am trying to find to top n% of the whole data.

> This n could be very large if my data is large.

>

> Assuming I have uniform rows of equal size and if the total data size

> is 10 GB, using the above mentioned approach, if I have to take top

> 10% of the whole data set, I need 10% of 10GB which could be rows

> worth of 1 GB (roughly) in my mappers.

> I think that would not be possible given my input splits are of

> 64/128/512 MB (based on my block size) or am I making wrong

> assumptions. I can increase the inputsplit size, but is there a better

> way to find top n%.

>

>

> My whole actual problem is to give ranks to some values and then find

> out the top 10 ranks.

>

> I think this context can give more idea about the problem ?

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

> wrote:

> > Hi,

> >

> > Can you tell more about:

> > * How big is N

> > * How big is the input dataset

> > * How many mappers you have

> > * Do input splits correlate with the sorting criterion for top N?

> >

> > Depending on the answers, very different strategies will be optimal.

> >

> >

> >

> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

> >wrote:

> >

> >> I am looking for a better solution for this.

> >>

> >> 1 way to do this would be to find top N values from each mappers and

> >> then find out the top N out of them in 1 reducer. I am afraid that

> >> this won't work effectively if my N is larger than number of values in

> >> my inputsplit (or mapper input).

> >>

> >> Otherway is to just sort all of them in 1 reducer and then do the cat of

> >> top-N.

> >>

> >> Wondering if there is any better approach to do this ?

> >>

> >> Regards

> >> Praveenesh

> >>

> >

> >

> >

> > --

> > Eugene Kirpichov

> > http://www.linkedin.com/in/eugenekirpichov

> > http://jkff.info/software/timeplotters - my performance visualization

> tools

>

--

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

https://gist.github.com/4696443

https://github.com/linkedin/datafu

On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> Actually what I am trying to find to top n% of the whole data.

> This n could be very large if my data is large.

>

> Assuming I have uniform rows of equal size and if the total data size

> is 10 GB, using the above mentioned approach, if I have to take top

> 10% of the whole data set, I need 10% of 10GB which could be rows

> worth of 1 GB (roughly) in my mappers.

> I think that would not be possible given my input splits are of

> 64/128/512 MB (based on my block size) or am I making wrong

> assumptions. I can increase the inputsplit size, but is there a better

> way to find top n%.

>

>

> My whole actual problem is to give ranks to some values and then find

> out the top 10 ranks.

>

> I think this context can give more idea about the problem ?

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

> wrote:

> > Hi,

> >

> > Can you tell more about:

> > * How big is N

> > * How big is the input dataset

> > * How many mappers you have

> > * Do input splits correlate with the sorting criterion for top N?

> >

> > Depending on the answers, very different strategies will be optimal.

> >

> >

> >

> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

> >wrote:

> >

> >> I am looking for a better solution for this.

> >>

> >> 1 way to do this would be to find top N values from each mappers and

> >> then find out the top N out of them in 1 reducer. I am afraid that

> >> this won't work effectively if my N is larger than number of values in

> >> my inputsplit (or mapper input).

> >>

> >> Otherway is to just sort all of them in 1 reducer and then do the cat of

> >> top-N.

> >>

> >> Wondering if there is any better approach to do this ?

> >>

> >> Regards

> >> Praveenesh

> >>

> >

> >

> >

> > --

> > Eugene Kirpichov

> > http://www.linkedin.com/in/eugenekirpichov

> > http://jkff.info/software/timeplotters - my performance visualization

> tools

>

--

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

Thanks for that Russell. Unfortunately I can't use Pig. Need to write

my own MR job. I was wondering how its usually done in the best way

possible.

Regards

Praveenesh

On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:

> Pig. Datafu. 7 lines of code.

>

> https://gist.github.com/4696443

> https://github.com/linkedin/datafu

>

>

> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>

>> Actually what I am trying to find to top n% of the whole data.

>> This n could be very large if my data is large.

>>

>> Assuming I have uniform rows of equal size and if the total data size

>> is 10 GB, using the above mentioned approach, if I have to take top

>> 10% of the whole data set, I need 10% of 10GB which could be rows

>> worth of 1 GB (roughly) in my mappers.

>> I think that would not be possible given my input splits are of

>> 64/128/512 MB (based on my block size) or am I making wrong

>> assumptions. I can increase the inputsplit size, but is there a better

>> way to find top n%.

>>

>>

>> My whole actual problem is to give ranks to some values and then find

>> out the top 10 ranks.

>>

>> I think this context can give more idea about the problem ?

>>

>> Regards

>> Praveenesh

>>

>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>> wrote:

>> > Hi,

>> >

>> > Can you tell more about:

>> > * How big is N

>> > * How big is the input dataset

>> > * How many mappers you have

>> > * Do input splits correlate with the sorting criterion for top N?

>> >

>> > Depending on the answers, very different strategies will be optimal.

>> >

>> >

>> >

>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

>> >wrote:

>> >

>> >> I am looking for a better solution for this.

>> >>

>> >> 1 way to do this would be to find top N values from each mappers and

>> >> then find out the top N out of them in 1 reducer. I am afraid that

>> >> this won't work effectively if my N is larger than number of values in

>> >> my inputsplit (or mapper input).

>> >>

>> >> Otherway is to just sort all of them in 1 reducer and then do the cat of

>> >> top-N.

>> >>

>> >> Wondering if there is any better approach to do this ?

>> >>

>> >> Regards

>> >> Praveenesh

>> >>

>> >

>> >

>> >

>> > --

>> > Eugene Kirpichov

>> > http://www.linkedin.com/in/eugenekirpichov

>> > http://jkff.info/software/timeplotters - my performance visualization

>> tools

>>

>

>

>

> --

> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

my own MR job. I was wondering how its usually done in the best way

possible.

Regards

Praveenesh

On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:

> Pig. Datafu. 7 lines of code.

>

> https://gist.github.com/4696443

> https://github.com/linkedin/datafu

>

>

> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>

>> Actually what I am trying to find to top n% of the whole data.

>> This n could be very large if my data is large.

>>

>> Assuming I have uniform rows of equal size and if the total data size

>> is 10 GB, using the above mentioned approach, if I have to take top

>> 10% of the whole data set, I need 10% of 10GB which could be rows

>> worth of 1 GB (roughly) in my mappers.

>> I think that would not be possible given my input splits are of

>> 64/128/512 MB (based on my block size) or am I making wrong

>> assumptions. I can increase the inputsplit size, but is there a better

>> way to find top n%.

>>

>>

>> My whole actual problem is to give ranks to some values and then find

>> out the top 10 ranks.

>>

>> I think this context can give more idea about the problem ?

>>

>> Regards

>> Praveenesh

>>

>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>> wrote:

>> > Hi,

>> >

>> > Can you tell more about:

>> > * How big is N

>> > * How big is the input dataset

>> > * How many mappers you have

>> > * Do input splits correlate with the sorting criterion for top N?

>> >

>> > Depending on the answers, very different strategies will be optimal.

>> >

>> >

>> >

>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

>> >wrote:

>> >

>> >> I am looking for a better solution for this.

>> >>

>> >> 1 way to do this would be to find top N values from each mappers and

>> >> then find out the top N out of them in 1 reducer. I am afraid that

>> >> this won't work effectively if my N is larger than number of values in

>> >> my inputsplit (or mapper input).

>> >>

>> >> Otherway is to just sort all of them in 1 reducer and then do the cat of

>> >> top-N.

>> >>

>> >> Wondering if there is any better approach to do this ?

>> >>

>> >> Regards

>> >> Praveenesh

>> >>

>> >

>> >

>> >

>> > --

>> > Eugene Kirpichov

>> > http://www.linkedin.com/in/eugenekirpichov

>> > http://jkff.info/software/timeplotters - my performance visualization

>> tools

>>

>

>

>

> --

> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

Maybe look at the pig source to see how it does it?

Russell Jurney http://datasyndrome.com

On Feb 1, 2013, at 11:37 PM, praveenesh kumar <[EMAIL PROTECTED]> wrote:

> Thanks for that Russell. Unfortunately I can't use Pig. Need to write

> my own MR job. I was wondering how its usually done in the best way

> possible.

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:

>> Pig. Datafu. 7 lines of code.

>>

>> https://gist.github.com/4696443

>> https://github.com/linkedin/datafu

>>

>>

>> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>>

>>> Actually what I am trying to find to top n% of the whole data.

>>> This n could be very large if my data is large.

>>>

>>> Assuming I have uniform rows of equal size and if the total data size

>>> is 10 GB, using the above mentioned approach, if I have to take top

>>> 10% of the whole data set, I need 10% of 10GB which could be rows

>>> worth of 1 GB (roughly) in my mappers.

>>> I think that would not be possible given my input splits are of

>>> 64/128/512 MB (based on my block size) or am I making wrong

>>> assumptions. I can increase the inputsplit size, but is there a better

>>> way to find top n%.

>>>

>>>

>>> My whole actual problem is to give ranks to some values and then find

>>> out the top 10 ranks.

>>>

>>> I think this context can give more idea about the problem ?

>>>

>>> Regards

>>> Praveenesh

>>>

>>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>>> wrote:

>>>> Hi,

>>>>

>>>> Can you tell more about:

>>>> * How big is N

>>>> * How big is the input dataset

>>>> * How many mappers you have

>>>> * Do input splits correlate with the sorting criterion for top N?

>>>>

>>>> Depending on the answers, very different strategies will be optimal.

>>>>

>>>>

>>>>

>>>> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

>>>> wrote:

>>>>

>>>>> I am looking for a better solution for this.

>>>>>

>>>>> 1 way to do this would be to find top N values from each mappers and

>>>>> then find out the top N out of them in 1 reducer. I am afraid that

>>>>> this won't work effectively if my N is larger than number of values in

>>>>> my inputsplit (or mapper input).

>>>>>

>>>>> Otherway is to just sort all of them in 1 reducer and then do the cat of

>>>>> top-N.

>>>>>

>>>>> Wondering if there is any better approach to do this ?

>>>>>

>>>>> Regards

>>>>> Praveenesh

>>>>>

>>>>

>>>>

>>>>

>>>> --

>>>> Eugene Kirpichov

>>>> http://www.linkedin.com/in/eugenekirpichov

>>>> http://jkff.info/software/timeplotters - my performance visualization

>>> tools

>>>

>>

>>

>>

>> --

>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

Russell Jurney http://datasyndrome.com

On Feb 1, 2013, at 11:37 PM, praveenesh kumar <[EMAIL PROTECTED]> wrote:

> Thanks for that Russell. Unfortunately I can't use Pig. Need to write

> my own MR job. I was wondering how its usually done in the best way

> possible.

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:

>> Pig. Datafu. 7 lines of code.

>>

>> https://gist.github.com/4696443

>> https://github.com/linkedin/datafu

>>

>>

>> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>>

>>> Actually what I am trying to find to top n% of the whole data.

>>> This n could be very large if my data is large.

>>>

>>> Assuming I have uniform rows of equal size and if the total data size

>>> is 10 GB, using the above mentioned approach, if I have to take top

>>> 10% of the whole data set, I need 10% of 10GB which could be rows

>>> worth of 1 GB (roughly) in my mappers.

>>> I think that would not be possible given my input splits are of

>>> 64/128/512 MB (based on my block size) or am I making wrong

>>> assumptions. I can increase the inputsplit size, but is there a better

>>> way to find top n%.

>>>

>>>

>>> My whole actual problem is to give ranks to some values and then find

>>> out the top 10 ranks.

>>>

>>> I think this context can give more idea about the problem ?

>>>

>>> Regards

>>> Praveenesh

>>>

>>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>>> wrote:

>>>> Hi,

>>>>

>>>> Can you tell more about:

>>>> * How big is N

>>>> * How big is the input dataset

>>>> * How many mappers you have

>>>> * Do input splits correlate with the sorting criterion for top N?

>>>>

>>>> Depending on the answers, very different strategies will be optimal.

>>>>

>>>>

>>>>

>>>> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

>>>> wrote:

>>>>

>>>>> I am looking for a better solution for this.

>>>>>

>>>>> 1 way to do this would be to find top N values from each mappers and

>>>>> then find out the top N out of them in 1 reducer. I am afraid that

>>>>> this won't work effectively if my N is larger than number of values in

>>>>> my inputsplit (or mapper input).

>>>>>

>>>>> Otherway is to just sort all of them in 1 reducer and then do the cat of

>>>>> top-N.

>>>>>

>>>>> Wondering if there is any better approach to do this ?

>>>>>

>>>>> Regards

>>>>> Praveenesh

>>>>>

>>>>

>>>>

>>>>

>>>> --

>>>> Eugene Kirpichov

>>>> http://www.linkedin.com/in/eugenekirpichov

>>>> http://jkff.info/software/timeplotters - my performance visualization

>>> tools

>>>

>>

>>

>>

>> --

>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

My actual problem is to rank all values and then run logic 1 to top n%

values and logic 2 to rest values.

1st - Ranking ? (need major suggestions here)

2nd - Find top n% out of them.

Then rest is covered.

Regards

Praveenesh

On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote:

> there's one thing i want to clarify that you can use multi-reducers to sort

> the data globally and then cat all the parts to get the top n records. The

> data in all parts are globally in order.

> Then you may find the problem is much easier.

>

> 在 2013-2-2 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道：

>

>> Actually what I am trying to find to top n% of the whole data.

>> This n could be very large if my data is large.

>>

>> Assuming I have uniform rows of equal size and if the total data size

>> is 10 GB, using the above mentioned approach, if I have to take top

>> 10% of the whole data set, I need 10% of 10GB which could be rows

>> worth of 1 GB (roughly) in my mappers.

>> I think that would not be possible given my input splits are of

>> 64/128/512 MB (based on my block size) or am I making wrong

>> assumptions. I can increase the inputsplit size, but is there a better

>> way to find top n%.

>>

>>

>> My whole actual problem is to give ranks to some values and then find

>> out the top 10 ranks.

>>

>> I think this context can give more idea about the problem ?

>>

>> Regards

>> Praveenesh

>>

>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>> wrote:

>> > Hi,

>> >

>> > Can you tell more about:

>> > * How big is N

>> > * How big is the input dataset

>> > * How many mappers you have

>> > * Do input splits correlate with the sorting criterion for top N?

>> >

>> > Depending on the answers, very different strategies will be optimal.

>> >

>> >

>> >

>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

>> > <[EMAIL PROTECTED]>wrote:

>> >

>> >> I am looking for a better solution for this.

>> >>

>> >> 1 way to do this would be to find top N values from each mappers and

>> >> then find out the top N out of them in 1 reducer. I am afraid that

>> >> this won't work effectively if my N is larger than number of values in

>> >> my inputsplit (or mapper input).

>> >>

>> >> Otherway is to just sort all of them in 1 reducer and then do the cat

>> >> of

>> >> top-N.

>> >>

>> >> Wondering if there is any better approach to do this ?

>> >>

>> >> Regards

>> >> Praveenesh

>> >>

>> >

>> >

>> >

>> > --

>> > Eugene Kirpichov

>> > http://www.linkedin.com/in/eugenekirpichov

>> > http://jkff.info/software/timeplotters - my performance visualization

>> > tools

values and logic 2 to rest values.

1st - Ranking ? (need major suggestions here)

2nd - Find top n% out of them.

Then rest is covered.

Regards

Praveenesh

On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote:

> there's one thing i want to clarify that you can use multi-reducers to sort

> the data globally and then cat all the parts to get the top n records. The

> data in all parts are globally in order.

> Then you may find the problem is much easier.

>

> 在 2013-2-2 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道：

>

>> Actually what I am trying to find to top n% of the whole data.

>> This n could be very large if my data is large.

>>

>> Assuming I have uniform rows of equal size and if the total data size

>> is 10 GB, using the above mentioned approach, if I have to take top

>> 10% of the whole data set, I need 10% of 10GB which could be rows

>> worth of 1 GB (roughly) in my mappers.

>> I think that would not be possible given my input splits are of

>> 64/128/512 MB (based on my block size) or am I making wrong

>> assumptions. I can increase the inputsplit size, but is there a better

>> way to find top n%.

>>

>>

>> My whole actual problem is to give ranks to some values and then find

>> out the top 10 ranks.

>>

>> I think this context can give more idea about the problem ?

>>

>> Regards

>> Praveenesh

>>

>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>> wrote:

>> > Hi,

>> >

>> > Can you tell more about:

>> > * How big is N

>> > * How big is the input dataset

>> > * How many mappers you have

>> > * Do input splits correlate with the sorting criterion for top N?

>> >

>> > Depending on the answers, very different strategies will be optimal.

>> >

>> >

>> >

>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

>> > <[EMAIL PROTECTED]>wrote:

>> >

>> >> I am looking for a better solution for this.

>> >>

>> >> 1 way to do this would be to find top N values from each mappers and

>> >> then find out the top N out of them in 1 reducer. I am afraid that

>> >> this won't work effectively if my N is larger than number of values in

>> >> my inputsplit (or mapper input).

>> >>

>> >> Otherway is to just sort all of them in 1 reducer and then do the cat

>> >> of

>> >> top-N.

>> >>

>> >> Wondering if there is any better approach to do this ?

>> >>

>> >> Regards

>> >> Praveenesh

>> >>

>> >

>> >

>> >

>> > --

>> > Eugene Kirpichov

>> > http://www.linkedin.com/in/eugenekirpichov

>> > http://jkff.info/software/timeplotters - my performance visualization

>> > tools

My suggestion is to use secondary sort with a single reducer. That easy you

can easily extract the top N. If you want to get the top N% you'll need an

additional phase to determine how many records this N% really is.

--

Met vriendelijke groet,

Niels Basjes

(Verstuurd vanaf mobiel )

Op 2 feb. 2013 12:08 schreef "praveenesh kumar" <[EMAIL PROTECTED]> het

volgende:

> My actual problem is to rank all values and then run logic 1 to top n%

> values and logic 2 to rest values.

> 1st - Ranking ? (need major suggestions here)

> 2nd - Find top n% out of them.

> Then rest is covered.

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote:

> > there's one thing i want to clarify that you can use multi-reducers to

> sort

> > the data globally and then cat all the parts to get the top n records.

> The

> > data in all parts are globally in order.

> > Then you may find the problem is much easier.

> >

> > 在 2013-2-2 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道：

> >

> >> Actually what I am trying to find to top n% of the whole data.

> >> This n could be very large if my data is large.

> >>

> >> Assuming I have uniform rows of equal size and if the total data size

> >> is 10 GB, using the above mentioned approach, if I have to take top

> >> 10% of the whole data set, I need 10% of 10GB which could be rows

> >> worth of 1 GB (roughly) in my mappers.

> >> I think that would not be possible given my input splits are of

> >> 64/128/512 MB (based on my block size) or am I making wrong

> >> assumptions. I can increase the inputsplit size, but is there a better

> >> way to find top n%.

> >>

> >>

> >> My whole actual problem is to give ranks to some values and then find

> >> out the top 10 ranks.

> >>

> >> I think this context can give more idea about the problem ?

> >>

> >> Regards

> >> Praveenesh

> >>

> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]

> >

> >> wrote:

> >> > Hi,

> >> >

> >> > Can you tell more about:

> >> > * How big is N

> >> > * How big is the input dataset

> >> > * How many mappers you have

> >> > * Do input splits correlate with the sorting criterion for top N?

> >> >

> >> > Depending on the answers, very different strategies will be optimal.

> >> >

> >> >

> >> >

> >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

> >> > <[EMAIL PROTECTED]>wrote:

> >> >

> >> >> I am looking for a better solution for this.

> >> >>

> >> >> 1 way to do this would be to find top N values from each mappers and

> >> >> then find out the top N out of them in 1 reducer. I am afraid that

> >> >> this won't work effectively if my N is larger than number of values

> in

> >> >> my inputsplit (or mapper input).

> >> >>

> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat

> >> >> of

> >> >> top-N.

> >> >>

> >> >> Wondering if there is any better approach to do this ?

> >> >>

> >> >> Regards

> >> >> Praveenesh

> >> >>

> >> >

> >> >

> >> >

> >> > --

> >> > Eugene Kirpichov

> >> > http://www.linkedin.com/in/eugenekirpichov

> >> > http://jkff.info/software/timeplotters - my performance visualization

> >> > tools

>

can easily extract the top N. If you want to get the top N% you'll need an

additional phase to determine how many records this N% really is.

--

Met vriendelijke groet,

Niels Basjes

(Verstuurd vanaf mobiel )

Op 2 feb. 2013 12:08 schreef "praveenesh kumar" <[EMAIL PROTECTED]> het

volgende:

> My actual problem is to rank all values and then run logic 1 to top n%

> values and logic 2 to rest values.

> 1st - Ranking ? (need major suggestions here)

> 2nd - Find top n% out of them.

> Then rest is covered.

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote:

> > there's one thing i want to clarify that you can use multi-reducers to

> sort

> > the data globally and then cat all the parts to get the top n records.

> The

> > data in all parts are globally in order.

> > Then you may find the problem is much easier.

> >

> > 在 2013-2-2 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道：

> >

> >> Actually what I am trying to find to top n% of the whole data.

> >> This n could be very large if my data is large.

> >>

> >> Assuming I have uniform rows of equal size and if the total data size

> >> is 10 GB, using the above mentioned approach, if I have to take top

> >> 10% of the whole data set, I need 10% of 10GB which could be rows

> >> worth of 1 GB (roughly) in my mappers.

> >> I think that would not be possible given my input splits are of

> >> 64/128/512 MB (based on my block size) or am I making wrong

> >> assumptions. I can increase the inputsplit size, but is there a better

> >> way to find top n%.

> >>

> >>

> >> My whole actual problem is to give ranks to some values and then find

> >> out the top 10 ranks.

> >>

> >> I think this context can give more idea about the problem ?

> >>

> >> Regards

> >> Praveenesh

> >>

> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]

> >

> >> wrote:

> >> > Hi,

> >> >

> >> > Can you tell more about:

> >> > * How big is N

> >> > * How big is the input dataset

> >> > * How many mappers you have

> >> > * Do input splits correlate with the sorting criterion for top N?

> >> >

> >> > Depending on the answers, very different strategies will be optimal.

> >> >

> >> >

> >> >

> >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

> >> > <[EMAIL PROTECTED]>wrote:

> >> >

> >> >> I am looking for a better solution for this.

> >> >>

> >> >> 1 way to do this would be to find top N values from each mappers and

> >> >> then find out the top N out of them in 1 reducer. I am afraid that

> >> >> this won't work effectively if my N is larger than number of values

> in

> >> >> my inputsplit (or mapper input).

> >> >>

> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat

> >> >> of

> >> >> top-N.

> >> >>

> >> >> Wondering if there is any better approach to do this ?

> >> >>

> >> >> Regards

> >> >> Praveenesh

> >> >>

> >> >

> >> >

> >> >

> >> > --

> >> > Eugene Kirpichov

> >> > http://www.linkedin.com/in/eugenekirpichov

> >> > http://jkff.info/software/timeplotters - my performance visualization

> >> > tools

>

Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.

Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.

Service operated by Sematext

Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.

Service operated by Sematext