Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to LIMIT a relation by percentage


Copy link to this message
-
Re: How to LIMIT a relation by percentage
Thank you guys! It worked for me:

This is to get top 20%:

A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int);
B = GROUP A BY category;

topResults = FOREACH B {
    count = COUNT(A);
    result = TOP((int)(count * (20 / 100.0)), 2, A);
      GENERATE FLATTEN(result);
}

dump topResults;

On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]> wrote:
> Hi Dmitriy -- great info, thanks.
>
> On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>> You could also do it with TOP as Norbert suggests, but that has a bit of
>> extra cost due to the sort TOP does.
>
> Just for my understanding, doesn't the ORDER BY in the PIG-1926
> example impose the same sort cost?  Seems that you'd have pay for a
> sort as long as the requirement is top N.
>
> Norbert
>
>> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <[EMAIL PROTECTED]>wrote:
>>
>>> Hi Ruslan -- no need to write your own UDF.  There is a built-in
>>> function TOP() which will extract for you the top N tuples of a
>>> relation, where N is a configurable parameter.  Take a look at:
>>>
>>> http://pig.apache.org/docs/r0.9.0/func.html#topx
>>>
>>> Norbert
>>>
>>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
>>> <[EMAIL PROTECTED]> wrote:
>>> > Hey guys,
>>> >
>>> > How can I LIMIT a relation by percentage?
>>> > What I need is to sort a relation by a numeric column and then take
>>> > top 5% of tuples.
>>> > As far as I understand I cannot use an expression in the LIMIT
>>> > operator. Do I have to write my own UDF? What type of UDF should I use
>>> > then?
>>> >
>>> > --
>>> > Best Regards,
>>> > Ruslan Al-Fakikh
>>> >
>>>
>>
>

--
Best Regards,
Ruslan Al-Fakikh