Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to LIMIT a relation by percentage


Copy link to this message
-
Re: How to LIMIT a relation by percentage
Hello Dmitriy,

I guess you mean this example:
a = LOAD 'a.txt';
b = GROUP a all;
c = FOREACH b GENERATE COUNT(a) AS sum;
d = ORDER a BY $0;
e = LIMIT d c.sum/100;

But here they group all tuples.

In my example:
A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int);
B = GROUP A BY category;

topResults = FOREACH B {
   count = COUNT(A);
   result = TOP((int)(count * (20 / 100.0)), 2, A);
     GENERATE FLATTEN(result);
}

I group by category. Actually what I need in the end is to take top
20% visitors (visitors with the biggest numbers of impressions) per
category.
So, probably it can't be optimized, or am I missing something?

Thanks in advance!

On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> This isn't going to be very efficient -- Pig will figure out that it can do
> COUNT in a distributed fashion (count produced on each mapper, and summed at
> the reducer)
>
> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of
> (top 3 of first 20, top 3 of next 20, etc)).  But since in this case Pig
> won't know how many of the top items to keep on a mapper until it's done the
> count, it won't kick into this optimization.  If you are dealing with large
> datasets, calculating the count in a separate group-all, as in the example
> in the jira I linked to, is going to be much better.
>
> D
>
> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh <
> [EMAIL PROTECTED]> wrote:
>
>> Thank you guys! It worked for me:
>>
>> This is to get top 20%:
>>
>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions:
>> int);
>> B = GROUP A BY category;
>>
>> topResults = FOREACH B {
>>    count = COUNT(A);
>>    result = TOP((int)(count * (20 / 100.0)), 2, A);
>>      GENERATE FLATTEN(result);
>> }
>>
>> dump topResults;
>>
>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]>
>> wrote:
>> > Hi Dmitriy -- great info, thanks.
>> >
>> > On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
>> wrote:
>> >> You could also do it with TOP as Norbert suggests, but that has a bit of
>> >> extra cost due to the sort TOP does.
>> >
>> > Just for my understanding, doesn't the ORDER BY in the PIG-1926
>> > example impose the same sort cost?  Seems that you'd have pay for a
>> > sort as long as the requirement is top N.
>> >
>> > Norbert
>> >
>> >> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <
>> [EMAIL PROTECTED]>wrote:
>> >>
>> >>> Hi Ruslan -- no need to write your own UDF.  There is a built-in
>> >>> function TOP() which will extract for you the top N tuples of a
>> >>> relation, where N is a configurable parameter.  Take a look at:
>> >>>
>> >>> http://pig.apache.org/docs/r0.9.0/func.html#topx
>> >>>
>> >>> Norbert
>> >>>
>> >>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
>> >>> <[EMAIL PROTECTED]> wrote:
>> >>> > Hey guys,
>> >>> >
>> >>> > How can I LIMIT a relation by percentage?
>> >>> > What I need is to sort a relation by a numeric column and then take
>> >>> > top 5% of tuples.
>> >>> > As far as I understand I cannot use an expression in the LIMIT
>> >>> > operator. Do I have to write my own UDF? What type of UDF should I
>> use
>> >>> > then?
>> >>> >
>> >>> > --
>> >>> > Best Regards,
>> >>> > Ruslan Al-Fakikh
>> >>> >
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Best Regards,
>> Ruslan Al-Fakikh
>>
>

--
Best Regards,
Ruslan Al-Fakikh
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB