Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to LIMIT a relation by percentage


Copy link to this message
-
Re: How to LIMIT a relation by percentage
Hello Dmitriy,

I guess you mean this example:
a = LOAD 'a.txt';
b = GROUP a all;
c = FOREACH b GENERATE COUNT(a) AS sum;
d = ORDER a BY $0;
e = LIMIT d c.sum/100;

But here they group all tuples.

In my example:
A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int);
B = GROUP A BY category;

topResults = FOREACH B {
   count = COUNT(A);
   result = TOP((int)(count * (20 / 100.0)), 2, A);
     GENERATE FLATTEN(result);
}

I group by category. Actually what I need in the end is to take top
20% visitors (visitors with the biggest numbers of impressions) per
category.
So, probably it can't be optimized, or am I missing something?

Thanks in advance!

On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> This isn't going to be very efficient -- Pig will figure out that it can do
> COUNT in a distributed fashion (count produced on each mapper, and summed at
> the reducer)
>
> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of
> (top 3 of first 20, top 3 of next 20, etc)).  But since in this case Pig
> won't know how many of the top items to keep on a mapper until it's done the
> count, it won't kick into this optimization.  If you are dealing with large
> datasets, calculating the count in a separate group-all, as in the example
> in the jira I linked to, is going to be much better.
>
> D
>
> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh <
> [EMAIL PROTECTED]> wrote:
>
>> Thank you guys! It worked for me:
>>
>> This is to get top 20%:
>>
>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions:
>> int);
>> B = GROUP A BY category;
>>
>> topResults = FOREACH B {
>>    count = COUNT(A);
>>    result = TOP((int)(count * (20 / 100.0)), 2, A);
>>      GENERATE FLATTEN(result);
>> }
>>
>> dump topResults;
>>
>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]>
>> wrote:
>> > Hi Dmitriy -- great info, thanks.
>> >
>> > On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
>> wrote:
>> >> You could also do it with TOP as Norbert suggests, but that has a bit of
>> >> extra cost due to the sort TOP does.
>> >
>> > Just for my understanding, doesn't the ORDER BY in the PIG-1926
>> > example impose the same sort cost?  Seems that you'd have pay for a
>> > sort as long as the requirement is top N.
>> >
>> > Norbert
>> >
>> >> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <
>> [EMAIL PROTECTED]>wrote:
>> >>
>> >>> Hi Ruslan -- no need to write your own UDF.  There is a built-in
>> >>> function TOP() which will extract for you the top N tuples of a
>> >>> relation, where N is a configurable parameter.  Take a look at:
>> >>>
>> >>> http://pig.apache.org/docs/r0.9.0/func.html#topx
>> >>>
>> >>> Norbert
>> >>>
>> >>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
>> >>> <[EMAIL PROTECTED]> wrote:
>> >>> > Hey guys,
>> >>> >
>> >>> > How can I LIMIT a relation by percentage?
>> >>> > What I need is to sort a relation by a numeric column and then take
>> >>> > top 5% of tuples.
>> >>> > As far as I understand I cannot use an expression in the LIMIT
>> >>> > operator. Do I have to write my own UDF? What type of UDF should I
>> use
>> >>> > then?
>> >>> >
>> >>> > --
>> >>> > Best Regards,
>> >>> > Ruslan Al-Fakikh
>> >>> >
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Best Regards,
>> Ruslan Al-Fakikh
>>
>

--
Best Regards,
Ruslan Al-Fakikh