Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Aggregations on nested foreach statements


+
Uri Laserson 2013-01-23, 01:17
+
Cheolsoo Park 2013-01-23, 02:01
Copy link to this message
-
Re: Aggregations on nested foreach statements
Thanks Cheolsoo!

Uri
On Tue, Jan 22, 2013 at 6:01 PM, Cheolsoo Park <[EMAIL PROTECTED]>wrote:

> Hi Uri,
>
> Try this:
>
> data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> iid:chararray, num1:int, num2:int);
> grouped = group data by cid;
> results = foreach grouped generate FLATTEN(data), SUM(data.num2) as sum;
> appended = foreach results generate cid, iid, num1, num2, (sum > 0 ? num1 :
> 0) as num3;
> dump appended;
>
> This will give you:
>
> (a,e,11,0,0)
> (b,f,2,2,2)
> (c,g,3,3,3)
> (c,h,44,44,44)
> (c,i,75,0,75)
> (d,j,89,0,0)
> (d,k,120,0,0)
> (d,l,3000,0,0)
>
> Thanks,
> Cheolsoo
>
>
> On Tue, Jan 22, 2013 at 5:17 PM, Uri Laserson <[EMAIL PROTECTED]>
> wrote:
>
> > I have data that looks like this:
> >
> > a e 11 0
> > b f 2 2
> > c g 3 3
> > c h 44 44
> > c i 75 0
> > d j 89 0
> > d k 120 0
> > d l 3000 0
> >
> > and I load it like so:
> >
> > data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> > iid:chararray, num1:int, num2:int);
> >
> > I want to group by the first column, cid.  For each group, if any of the
> > num2 values (last column) are positive, I want to output every tuple in
> > that group with an extra field equal to num1.  If all the num2 values for
> > that group are zero, then I want to output every tuple in that group with
> > an extra field equal to 0.
> >
> > I figured something like this would work:
> >
> > data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> > iid:chararray, num1:int, num2:int);
> > grouped = group data by cid;
> > results = foreach grouped {
> >     result1 = SUM(data.num2);
> >     extended = foreach data generate *, result1 > 0 ? num1 : 0;
> >     generate FLATTEN(extended);
> > };
> >
> > but it does not.  I get this error:
> >
> > 2013-01-22 17:15:07,647 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 1200: <line 98, column 48>  mismatched input '>' expecting
> SEMI_COLON
> >
> > What is the proper way to do this?  From the MapReduce perspective, I
> group
> > by the key, and in the reducer, I compute a value for each group, and
> then
> > emit every single value for that group along with some extra data.
> >
> > Thanks!
> > Uri
> >
> >
> >
> > --
> > Uri Laserson, PhD
> > Data Scientist, Cloudera
> > Twitter/GitHub: @laserson
> > +1 617 910 0447
> > [EMAIL PROTECTED]
> >
>

--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
[EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB