|
Ruslan Al-Fakikh
2011-09-08, 13:13
Norbert Burger
2011-09-08, 13:42
Dmitriy Ryaboy
2011-09-08, 16:19
Norbert Burger
2011-09-08, 17:03
Ruslan Al-Fakikh
2011-09-08, 19:46
Dmitriy Ryaboy
2011-09-09, 00:43
Dmitriy Ryaboy
2011-09-09, 00:45
Ruslan Al-Fakikh
2011-09-09, 11:20
Dmitriy Ryaboy
2011-09-09, 16:19
Ruslan Al-Fakikh
2011-09-12, 09:46
|
-
How to LIMIT a relation by percentageRuslan Al-Fakikh 2011-09-08, 13:13
Hey guys,
How can I LIMIT a relation by percentage? What I need is to sort a relation by a numeric column and then take top 5% of tuples. As far as I understand I cannot use an expression in the LIMIT operator. Do I have to write my own UDF? What type of UDF should I use then? -- Best Regards, Ruslan Al-Fakikh
-
Re: How to LIMIT a relation by percentageNorbert Burger 2011-09-08, 13:42
Hi Ruslan -- no need to write your own UDF. There is a built-in
function TOP() which will extract for you the top N tuples of a relation, where N is a configurable parameter. Take a look at: http://pig.apache.org/docs/r0.9.0/func.html#topx Norbert On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > Hey guys, > > How can I LIMIT a relation by percentage? > What I need is to sort a relation by a numeric column and then take > top 5% of tuples. > As far as I understand I cannot use an expression in the LIMIT > operator. Do I have to write my own UDF? What type of UDF should I use > then? > > -- > Best Regards, > Ruslan Al-Fakikh >
-
Re: How to LIMIT a relation by percentageDmitriy Ryaboy 2011-09-08, 16:19
The example in the body of the ticket
https://issues.apache.org/jira/browse/PIG-1926 is exactly the script you want. Note that this is a new feature, you need 0.10 (not released yet -- in trunk) to get this to work. You could also do it with TOP as Norbert suggests, but that has a bit of extra cost due to the sort TOP does. D On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <[EMAIL PROTECTED]>wrote: > Hi Ruslan -- no need to write your own UDF. There is a built-in > function TOP() which will extract for you the top N tuples of a > relation, where N is a configurable parameter. Take a look at: > > http://pig.apache.org/docs/r0.9.0/func.html#topx > > Norbert > > On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh > <[EMAIL PROTECTED]> wrote: > > Hey guys, > > > > How can I LIMIT a relation by percentage? > > What I need is to sort a relation by a numeric column and then take > > top 5% of tuples. > > As far as I understand I cannot use an expression in the LIMIT > > operator. Do I have to write my own UDF? What type of UDF should I use > > then? > > > > -- > > Best Regards, > > Ruslan Al-Fakikh > > >
-
Re: How to LIMIT a relation by percentageNorbert Burger 2011-09-08, 17:03
Hi Dmitriy -- great info, thanks.
On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > You could also do it with TOP as Norbert suggests, but that has a bit of > extra cost due to the sort TOP does. Just for my understanding, doesn't the ORDER BY in the PIG-1926 example impose the same sort cost? Seems that you'd have pay for a sort as long as the requirement is top N. Norbert > On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <[EMAIL PROTECTED]>wrote: > >> Hi Ruslan -- no need to write your own UDF. There is a built-in >> function TOP() which will extract for you the top N tuples of a >> relation, where N is a configurable parameter. Take a look at: >> >> http://pig.apache.org/docs/r0.9.0/func.html#topx >> >> Norbert >> >> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh >> <[EMAIL PROTECTED]> wrote: >> > Hey guys, >> > >> > How can I LIMIT a relation by percentage? >> > What I need is to sort a relation by a numeric column and then take >> > top 5% of tuples. >> > As far as I understand I cannot use an expression in the LIMIT >> > operator. Do I have to write my own UDF? What type of UDF should I use >> > then? >> > >> > -- >> > Best Regards, >> > Ruslan Al-Fakikh >> > >> >
-
Re: How to LIMIT a relation by percentageRuslan Al-Fakikh 2011-09-08, 19:46
Thank you guys! It worked for me:
This is to get top 20%: A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int); B = GROUP A BY category; topResults = FOREACH B { count = COUNT(A); result = TOP((int)(count * (20 / 100.0)), 2, A); GENERATE FLATTEN(result); } dump topResults; On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]> wrote: > Hi Dmitriy -- great info, thanks. > > On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> You could also do it with TOP as Norbert suggests, but that has a bit of >> extra cost due to the sort TOP does. > > Just for my understanding, doesn't the ORDER BY in the PIG-1926 > example impose the same sort cost? Seems that you'd have pay for a > sort as long as the requirement is top N. > > Norbert > >> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <[EMAIL PROTECTED]>wrote: >> >>> Hi Ruslan -- no need to write your own UDF. There is a built-in >>> function TOP() which will extract for you the top N tuples of a >>> relation, where N is a configurable parameter. Take a look at: >>> >>> http://pig.apache.org/docs/r0.9.0/func.html#topx >>> >>> Norbert >>> >>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh >>> <[EMAIL PROTECTED]> wrote: >>> > Hey guys, >>> > >>> > How can I LIMIT a relation by percentage? >>> > What I need is to sort a relation by a numeric column and then take >>> > top 5% of tuples. >>> > As far as I understand I cannot use an expression in the LIMIT >>> > operator. Do I have to write my own UDF? What type of UDF should I use >>> > then? >>> > >>> > -- >>> > Best Regards, >>> > Ruslan Al-Fakikh >>> > >>> >> > -- Best Regards, Ruslan Al-Fakikh
-
Re: How to LIMIT a relation by percentageDmitriy Ryaboy 2011-09-09, 00:43
This isn't going to be very efficient -- Pig will figure out that it can do
COUNT in a distributed fashion (count produced on each mapper, and summed at the reducer) Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of (top 3 of first 20, top 3 of next 20, etc)). But since in this case Pig won't know how many of the top items to keep on a mapper until it's done the count, it won't kick into this optimization. If you are dealing with large datasets, calculating the count in a separate group-all, as in the example in the jira I linked to, is going to be much better. D On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh < [EMAIL PROTECTED]> wrote: > Thank you guys! It worked for me: > > This is to get top 20%: > > A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: > int); > B = GROUP A BY category; > > topResults = FOREACH B { > count = COUNT(A); > result = TOP((int)(count * (20 / 100.0)), 2, A); > GENERATE FLATTEN(result); > } > > dump topResults; > > On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]> > wrote: > > Hi Dmitriy -- great info, thanks. > > > > On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > >> You could also do it with TOP as Norbert suggests, but that has a bit of > >> extra cost due to the sort TOP does. > > > > Just for my understanding, doesn't the ORDER BY in the PIG-1926 > > example impose the same sort cost? Seems that you'd have pay for a > > sort as long as the requirement is top N. > > > > Norbert > > > >> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger < > [EMAIL PROTECTED]>wrote: > >> > >>> Hi Ruslan -- no need to write your own UDF. There is a built-in > >>> function TOP() which will extract for you the top N tuples of a > >>> relation, where N is a configurable parameter. Take a look at: > >>> > >>> http://pig.apache.org/docs/r0.9.0/func.html#topx > >>> > >>> Norbert > >>> > >>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh > >>> <[EMAIL PROTECTED]> wrote: > >>> > Hey guys, > >>> > > >>> > How can I LIMIT a relation by percentage? > >>> > What I need is to sort a relation by a numeric column and then take > >>> > top 5% of tuples. > >>> > As far as I understand I cannot use an expression in the LIMIT > >>> > operator. Do I have to write my own UDF? What type of UDF should I > use > >>> > then? > >>> > > >>> > -- > >>> > Best Regards, > >>> > Ruslan Al-Fakikh > >>> > > >>> > >> > > > > > > -- > Best Regards, > Ruslan Al-Fakikh >
-
Re: How to LIMIT a relation by percentageDmitriy Ryaboy 2011-09-09, 00:45
On Thu, Sep 8, 2011 at 10:03 AM, Norbert Burger <[EMAIL PROTECTED]>wrote:
> Hi Dmitriy -- great info, thanks. > > Just for my understanding, doesn't the ORDER BY in the PIG-1926 > example impose the same sort cost? Seems that you'd have pay for a > sort as long as the requirement is top N. > > TOP is actually more efficient than ORDER. In Ruslan's case, he doesn't need the order (or top) at all -- he just wants LIMIT, so that clause can be skipped. D > Norbert > > > On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <[EMAIL PROTECTED] > >wrote: > > > >> Hi Ruslan -- no need to write your own UDF. There is a built-in > >> function TOP() which will extract for you the top N tuples of a > >> relation, where N is a configurable parameter. Take a look at: > >> > >> http://pig.apache.org/docs/r0.9.0/func.html#topx > >> > >> Norbert > >> > >> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh > >> <[EMAIL PROTECTED]> wrote: > >> > Hey guys, > >> > > >> > How can I LIMIT a relation by percentage? > >> > What I need is to sort a relation by a numeric column and then take > >> > top 5% of tuples. > >> > As far as I understand I cannot use an expression in the LIMIT > >> > operator. Do I have to write my own UDF? What type of UDF should I use > >> > then? > >> > > >> > -- > >> > Best Regards, > >> > Ruslan Al-Fakikh > >> > > >> > > >
-
Re: How to LIMIT a relation by percentageRuslan Al-Fakikh 2011-09-09, 11:20
Hello Dmitriy,
I guess you mean this example: a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d = ORDER a BY $0; e = LIMIT d c.sum/100; But here they group all tuples. In my example: A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int); B = GROUP A BY category; topResults = FOREACH B { count = COUNT(A); result = TOP((int)(count * (20 / 100.0)), 2, A); GENERATE FLATTEN(result); } I group by category. Actually what I need in the end is to take top 20% visitors (visitors with the biggest numbers of impressions) per category. So, probably it can't be optimized, or am I missing something? Thanks in advance! On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > This isn't going to be very efficient -- Pig will figure out that it can do > COUNT in a distributed fashion (count produced on each mapper, and summed at > the reducer) > > Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of > (top 3 of first 20, top 3 of next 20, etc)). But since in this case Pig > won't know how many of the top items to keep on a mapper until it's done the > count, it won't kick into this optimization. If you are dealing with large > datasets, calculating the count in a separate group-all, as in the example > in the jira I linked to, is going to be much better. > > D > > On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh < > [EMAIL PROTECTED]> wrote: > >> Thank you guys! It worked for me: >> >> This is to get top 20%: >> >> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: >> int); >> B = GROUP A BY category; >> >> topResults = FOREACH B { >> count = COUNT(A); >> result = TOP((int)(count * (20 / 100.0)), 2, A); >> GENERATE FLATTEN(result); >> } >> >> dump topResults; >> >> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]> >> wrote: >> > Hi Dmitriy -- great info, thanks. >> > >> > On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> >> wrote: >> >> You could also do it with TOP as Norbert suggests, but that has a bit of >> >> extra cost due to the sort TOP does. >> > >> > Just for my understanding, doesn't the ORDER BY in the PIG-1926 >> > example impose the same sort cost? Seems that you'd have pay for a >> > sort as long as the requirement is top N. >> > >> > Norbert >> > >> >> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger < >> [EMAIL PROTECTED]>wrote: >> >> >> >>> Hi Ruslan -- no need to write your own UDF. There is a built-in >> >>> function TOP() which will extract for you the top N tuples of a >> >>> relation, where N is a configurable parameter. Take a look at: >> >>> >> >>> http://pig.apache.org/docs/r0.9.0/func.html#topx >> >>> >> >>> Norbert >> >>> >> >>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh >> >>> <[EMAIL PROTECTED]> wrote: >> >>> > Hey guys, >> >>> > >> >>> > How can I LIMIT a relation by percentage? >> >>> > What I need is to sort a relation by a numeric column and then take >> >>> > top 5% of tuples. >> >>> > As far as I understand I cannot use an expression in the LIMIT >> >>> > operator. Do I have to write my own UDF? What type of UDF should I >> use >> >>> > then? >> >>> > >> >>> > -- >> >>> > Best Regards, >> >>> > Ruslan Al-Fakikh >> >>> > >> >>> >> >> >> > >> >> >> >> -- >> Best Regards, >> Ruslan Al-Fakikh >> > -- Best Regards, Ruslan Al-Fakikh
-
Re: How to LIMIT a relation by percentageDmitriy Ryaboy 2011-09-09, 16:19
Just replace the call to TOP with a call to limit. In trunk, limit takes expressions as arguments (it only took constants before)
On Sep 9, 2011, at 4:20 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > Hello Dmitriy, > > I guess you mean this example: > a = LOAD 'a.txt'; > b = GROUP a all; > c = FOREACH b GENERATE COUNT(a) AS sum; > d = ORDER a BY $0; > e = LIMIT d c.sum/100; > > But here they group all tuples. > > In my example: > A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int); > B = GROUP A BY category; > > topResults = FOREACH B { > count = COUNT(A); > result = TOP((int)(count * (20 / 100.0)), 2, A); > GENERATE FLATTEN(result); > } > > I group by category. Actually what I need in the end is to take top > 20% visitors (visitors with the biggest numbers of impressions) per > category. > So, probably it can't be optimized, or am I missing something? > > Thanks in advance! > > On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> This isn't going to be very efficient -- Pig will figure out that it can do >> COUNT in a distributed fashion (count produced on each mapper, and summed at >> the reducer) >> >> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of >> (top 3 of first 20, top 3 of next 20, etc)). But since in this case Pig >> won't know how many of the top items to keep on a mapper until it's done the >> count, it won't kick into this optimization. If you are dealing with large >> datasets, calculating the count in a separate group-all, as in the example >> in the jira I linked to, is going to be much better. >> >> D >> >> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh < >> [EMAIL PROTECTED]> wrote: >> >>> Thank you guys! It worked for me: >>> >>> This is to get top 20%: >>> >>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: >>> int); >>> B = GROUP A BY category; >>> >>> topResults = FOREACH B { >>> count = COUNT(A); >>> result = TOP((int)(count * (20 / 100.0)), 2, A); >>> GENERATE FLATTEN(result); >>> } >>> >>> dump topResults; >>> >>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]> >>> wrote: >>>> Hi Dmitriy -- great info, thanks. >>>> >>>> On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> >>> wrote: >>>>> You could also do it with TOP as Norbert suggests, but that has a bit of >>>>> extra cost due to the sort TOP does. >>>> >>>> Just for my understanding, doesn't the ORDER BY in the PIG-1926 >>>> example impose the same sort cost? Seems that you'd have pay for a >>>> sort as long as the requirement is top N. >>>> >>>> Norbert >>>> >>>>> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger < >>> [EMAIL PROTECTED]>wrote: >>>>> >>>>>> Hi Ruslan -- no need to write your own UDF. There is a built-in >>>>>> function TOP() which will extract for you the top N tuples of a >>>>>> relation, where N is a configurable parameter. Take a look at: >>>>>> >>>>>> http://pig.apache.org/docs/r0.9.0/func.html#topx >>>>>> >>>>>> Norbert >>>>>> >>>>>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh >>>>>> <[EMAIL PROTECTED]> wrote: >>>>>>> Hey guys, >>>>>>> >>>>>>> How can I LIMIT a relation by percentage? >>>>>>> What I need is to sort a relation by a numeric column and then take >>>>>>> top 5% of tuples. >>>>>>> As far as I understand I cannot use an expression in the LIMIT >>>>>>> operator. Do I have to write my own UDF? What type of UDF should I >>> use >>>>>>> then? >>>>>>> >>>>>>> -- >>>>>>> Best Regards, >>>>>>> Ruslan Al-Fakikh >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ruslan Al-Fakikh >>> >> > > > > -- > Best Regards, > Ruslan Al-Fakikh
-
Re: How to LIMIT a relation by percentageRuslan Al-Fakikh 2011-09-12, 09:46
But we are now on 0.8 version and planning to move to 0.9, so we are
far away from 0.10 So, I guess my way is the only one for now:( On Fri, Sep 9, 2011 at 8:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Just replace the call to TOP with a call to limit. In trunk, limit takes expressions as arguments (it only took constants before) > > On Sep 9, 2011, at 4:20 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote: > >> Hello Dmitriy, >> >> I guess you mean this example: >> a = LOAD 'a.txt'; >> b = GROUP a all; >> c = FOREACH b GENERATE COUNT(a) AS sum; >> d = ORDER a BY $0; >> e = LIMIT d c.sum/100; >> >> But here they group all tuples. >> >> In my example: >> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int); >> B = GROUP A BY category; >> >> topResults = FOREACH B { >> count = COUNT(A); >> result = TOP((int)(count * (20 / 100.0)), 2, A); >> GENERATE FLATTEN(result); >> } >> >> I group by category. Actually what I need in the end is to take top >> 20% visitors (visitors with the biggest numbers of impressions) per >> category. >> So, probably it can't be optimized, or am I missing something? >> >> Thanks in advance! >> >> On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >>> This isn't going to be very efficient -- Pig will figure out that it can do >>> COUNT in a distributed fashion (count produced on each mapper, and summed at >>> the reducer) >>> >>> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of >>> (top 3 of first 20, top 3 of next 20, etc)). But since in this case Pig >>> won't know how many of the top items to keep on a mapper until it's done the >>> count, it won't kick into this optimization. If you are dealing with large >>> datasets, calculating the count in a separate group-all, as in the example >>> in the jira I linked to, is going to be much better. >>> >>> D >>> >>> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Thank you guys! It worked for me: >>>> >>>> This is to get top 20%: >>>> >>>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: >>>> int); >>>> B = GROUP A BY category; >>>> >>>> topResults = FOREACH B { >>>> count = COUNT(A); >>>> result = TOP((int)(count * (20 / 100.0)), 2, A); >>>> GENERATE FLATTEN(result); >>>> } >>>> >>>> dump topResults; >>>> >>>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]> >>>> wrote: >>>>> Hi Dmitriy -- great info, thanks. >>>>> >>>>> On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> >>>> wrote: >>>>>> You could also do it with TOP as Norbert suggests, but that has a bit of >>>>>> extra cost due to the sort TOP does. >>>>> >>>>> Just for my understanding, doesn't the ORDER BY in the PIG-1926 >>>>> example impose the same sort cost? Seems that you'd have pay for a >>>>> sort as long as the requirement is top N. >>>>> >>>>> Norbert >>>>> >>>>>> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger < >>>> [EMAIL PROTECTED]>wrote: >>>>>> >>>>>>> Hi Ruslan -- no need to write your own UDF. There is a built-in >>>>>>> function TOP() which will extract for you the top N tuples of a >>>>>>> relation, where N is a configurable parameter. Take a look at: >>>>>>> >>>>>>> http://pig.apache.org/docs/r0.9.0/func.html#topx >>>>>>> >>>>>>> Norbert >>>>>>> >>>>>>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh >>>>>>> <[EMAIL PROTECTED]> wrote: >>>>>>>> Hey guys, >>>>>>>> >>>>>>>> How can I LIMIT a relation by percentage? >>>>>>>> What I need is to sort a relation by a numeric column and then take >>>>>>>> top 5% of tuples. >>>>>>>> As far as I understand I cannot use an expression in the LIMIT >>>>>>>> operator. Do I have to write my own UDF? What type of UDF should I >>>> use >>>>>>>> then? >>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards, >>>>>>>> Ruslan Al-Fakikh >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ruslan Al-Fakikh Best Regards, Ruslan Al-Fakikh |