Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to LIMIT a relation by percentage


Copy link to this message
-
Re: How to LIMIT a relation by percentage
But we are now on 0.8 version and planning to move to 0.9, so we are
far away from 0.10

So, I guess my way is the only one for now:(

On Fri, Sep 9, 2011 at 8:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Just replace the call to TOP with a call to limit. In trunk, limit takes expressions as arguments (it only took constants before)
>
> On Sep 9, 2011, at 4:20 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
>
>> Hello Dmitriy,
>>
>> I guess you mean this example:
>> a = LOAD 'a.txt';
>> b = GROUP a all;
>> c = FOREACH b GENERATE COUNT(a) AS sum;
>> d = ORDER a BY $0;
>> e = LIMIT d c.sum/100;
>>
>> But here they group all tuples.
>>
>> In my example:
>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int);
>> B = GROUP A BY category;
>>
>> topResults = FOREACH B {
>>   count = COUNT(A);
>>   result = TOP((int)(count * (20 / 100.0)), 2, A);
>>     GENERATE FLATTEN(result);
>> }
>>
>> I group by category. Actually what I need in the end is to take top
>> 20% visitors (visitors with the biggest numbers of impressions) per
>> category.
>> So, probably it can't be optimized, or am I missing something?
>>
>> Thanks in advance!
>>
>> On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>>> This isn't going to be very efficient -- Pig will figure out that it can do
>>> COUNT in a distributed fashion (count produced on each mapper, and summed at
>>> the reducer)
>>>
>>> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of
>>> (top 3 of first 20, top 3 of next 20, etc)).  But since in this case Pig
>>> won't know how many of the top items to keep on a mapper until it's done the
>>> count, it won't kick into this optimization.  If you are dealing with large
>>> datasets, calculating the count in a separate group-all, as in the example
>>> in the jira I linked to, is going to be much better.
>>>
>>> D
>>>
>>> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Thank you guys! It worked for me:
>>>>
>>>> This is to get top 20%:
>>>>
>>>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions:
>>>> int);
>>>> B = GROUP A BY category;
>>>>
>>>> topResults = FOREACH B {
>>>>    count = COUNT(A);
>>>>    result = TOP((int)(count * (20 / 100.0)), 2, A);
>>>>      GENERATE FLATTEN(result);
>>>> }
>>>>
>>>> dump topResults;
>>>>
>>>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[EMAIL PROTECTED]>
>>>> wrote:
>>>>> Hi Dmitriy -- great info, thanks.
>>>>>
>>>>> On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
>>>> wrote:
>>>>>> You could also do it with TOP as Norbert suggests, but that has a bit of
>>>>>> extra cost due to the sort TOP does.
>>>>>
>>>>> Just for my understanding, doesn't the ORDER BY in the PIG-1926
>>>>> example impose the same sort cost?  Seems that you'd have pay for a
>>>>> sort as long as the requirement is top N.
>>>>>
>>>>> Norbert
>>>>>
>>>>>> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <
>>>> [EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>> Hi Ruslan -- no need to write your own UDF.  There is a built-in
>>>>>>> function TOP() which will extract for you the top N tuples of a
>>>>>>> relation, where N is a configurable parameter.  Take a look at:
>>>>>>>
>>>>>>> http://pig.apache.org/docs/r0.9.0/func.html#topx
>>>>>>>
>>>>>>> Norbert
>>>>>>>
>>>>>>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
>>>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>>>> Hey guys,
>>>>>>>>
>>>>>>>> How can I LIMIT a relation by percentage?
>>>>>>>> What I need is to sort a relation by a numeric column and then take
>>>>>>>> top 5% of tuples.
>>>>>>>> As far as I understand I cannot use an expression in the LIMIT
>>>>>>>> operator. Do I have to write my own UDF? What type of UDF should I
>>>> use
>>>>>>>> then?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards,
>>>>>>>> Ruslan Al-Fakikh
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ruslan Al-Fakikh

Best Regards,
Ruslan Al-Fakikh