Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Aggregations on nested foreach statements


+
Uri Laserson 2013-01-23, 01:17
+
Cheolsoo Park 2013-01-23, 02:01
Copy link to this message
-
Re: Aggregations on nested foreach statements
Uri Laserson 2013-01-24, 09:01
Thanks Cheolsoo!

Uri
On Tue, Jan 22, 2013 at 6:01 PM, Cheolsoo Park <[EMAIL PROTECTED]>wrote:

> Hi Uri,
>
> Try this:
>
> data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> iid:chararray, num1:int, num2:int);
> grouped = group data by cid;
> results = foreach grouped generate FLATTEN(data), SUM(data.num2) as sum;
> appended = foreach results generate cid, iid, num1, num2, (sum > 0 ? num1 :
> 0) as num3;
> dump appended;
>
> This will give you:
>
> (a,e,11,0,0)
> (b,f,2,2,2)
> (c,g,3,3,3)
> (c,h,44,44,44)
> (c,i,75,0,75)
> (d,j,89,0,0)
> (d,k,120,0,0)
> (d,l,3000,0,0)
>
> Thanks,
> Cheolsoo
>
>
> On Tue, Jan 22, 2013 at 5:17 PM, Uri Laserson <[EMAIL PROTECTED]>
> wrote:
>
> > I have data that looks like this:
> >
> > a e 11 0
> > b f 2 2
> > c g 3 3
> > c h 44 44
> > c i 75 0
> > d j 89 0
> > d k 120 0
> > d l 3000 0
> >
> > and I load it like so:
> >
> > data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> > iid:chararray, num1:int, num2:int);
> >
> > I want to group by the first column, cid.  For each group, if any of the
> > num2 values (last column) are positive, I want to output every tuple in
> > that group with an extra field equal to num1.  If all the num2 values for
> > that group are zero, then I want to output every tuple in that group with
> > an extra field equal to 0.
> >
> > I figured something like this would work:
> >
> > data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> > iid:chararray, num1:int, num2:int);
> > grouped = group data by cid;
> > results = foreach grouped {
> >     result1 = SUM(data.num2);
> >     extended = foreach data generate *, result1 > 0 ? num1 : 0;
> >     generate FLATTEN(extended);
> > };
> >
> > but it does not.  I get this error:
> >
> > 2013-01-22 17:15:07,647 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 1200: <line 98, column 48>  mismatched input '>' expecting
> SEMI_COLON
> >
> > What is the proper way to do this?  From the MapReduce perspective, I
> group
> > by the key, and in the reducer, I compute a value for each group, and
> then
> > emit every single value for that group along with some extra data.
> >
> > Thanks!
> > Uri
> >
> >
> >
> > --
> > Uri Laserson, PhD
> > Data Scientist, Cloudera
> > Twitter/GitHub: @laserson
> > +1 617 910 0447
> > [EMAIL PROTECTED]
> >
>

--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
[EMAIL PROTECTED]