Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - How to LIMIT a relation by percentage


Copy link to this message
-
Re: How to LIMIT a relation by percentage
Norbert Burger 2011-09-08, 17:03
Hi Dmitriy -- great info, thanks.

On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> You could also do it with TOP as Norbert suggests, but that has a bit of
> extra cost due to the sort TOP does.

Just for my understanding, doesn't the ORDER BY in the PIG-1926
example impose the same sort cost?  Seems that you'd have pay for a
sort as long as the requirement is top N.

Norbert

> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <[EMAIL PROTECTED]>wrote:
>
>> Hi Ruslan -- no need to write your own UDF.  There is a built-in
>> function TOP() which will extract for you the top N tuples of a
>> relation, where N is a configurable parameter.  Take a look at:
>>
>> http://pig.apache.org/docs/r0.9.0/func.html#topx
>>
>> Norbert
>>
>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
>> <[EMAIL PROTECTED]> wrote:
>> > Hey guys,
>> >
>> > How can I LIMIT a relation by percentage?
>> > What I need is to sort a relation by a numeric column and then take
>> > top 5% of tuples.
>> > As far as I understand I cannot use an expression in the LIMIT
>> > operator. Do I have to write my own UDF? What type of UDF should I use
>> > then?
>> >
>> > --
>> > Best Regards,
>> > Ruslan Al-Fakikh
>> >
>>
>