Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF for generating top xx % of results?


Copy link to this message
-
RE: UDF for generating top xx % of results?
Hi Dave,

TOP = FILTER B BY count > V; would not work right now with pig. Alias can be
casted into scalars after https://issues.apache.org/jira/browse/PIG-1434 is
added. As far as I know, there is no simple way of using values generated by
mapreduce jobs directly into Pig jobs, you will need UDF to do that.
One dirty approach is to store this value in a location with a store and
then later use it to generate a new file with $0 and V and filter with $0 >
$1.

Thanks,
Aniket

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Dave
Viner
Sent: Tuesday, June 29, 2010 3:24 PM
To: [EMAIL PROTECTED]
Subject: Re: UDF for generating top xx % of results?

Actually, I've gotten the first half of the code to work now.   Here's how
it looks:
X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner

On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[EMAIL PROTECTED]> wrote:

> I don't quite understand this pig latin.  The piggybank
> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
> which are compared.  Here's the sample I'm trying using the tutorial
> excitelog as a sample.
>
> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> (user:chararray, time:long, query:chararray);
> Z = FILTER X BY query is not null;
> A = GROUP Z BY query;
> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> U = FOREACH B GENERATE *,
>     MIN(count) as min:long,
>     MAX(count) as max:long;
>
> This doesn't seem to work at all.  It dies with this error:
> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> Other Field Schema: min: long
>
> Changing the min:long and max:long to doubles (as suggested by the error
> message), causes this error:
> ERROR 1045: Could not infer the matching function for
> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
> explicit cast.
>
> What am I missing in using the sample code you've provided?  I can't seem
> to get it to work...
>
> Thanks for your help.
> Dave Viner
>
>
> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[EMAIL PROTECTED]> wrote:
>
>> That's what I tried to say in my last email.I don't believe you can
>> calculate exactly the percentiles in just one pass. Writing out the pig
>> for
>> two pass algorithm should be easy enough..
>>
>> P = group TABLE all;
>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>> V = foreach U generate min + (max-min)*0.95;
>>
>> would give you the 95th percentile cutoff, and u just filter or split by
>> V.
>>
>>
>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[EMAIL PROTECTED]> wrote:
>>
>> > How would I calculate the percentile in one pass?  In order to
calculate
>> > the
>> > percentile for each item, I need to know the total count.  How do I get
>> the
>> > total count, and then calculate each item's percentile in one pass?
>> >
>> > I don't mind doing multiple passes - I am just not sure how to make the
>> > calculation.
>> >
>> > Thanks
>> > Dave Viner
>> >
>> >
>> > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[EMAIL PROTECTED]> wrote:
>> >
>> > > I think it's impossible to do this within one M/R. You will want to
>> > > implement it in two M/R in Pig, because you have to calculate the
>> > > percentile
>> > > in pass 1, and then perform the filter in pass 2.
>> > >
>> > >
>> >