Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Aggregations on nested foreach statements


+
Uri Laserson 2013-01-23, 01:17
Copy link to this message
-
Re: Aggregations on nested foreach statements
Hi Uri,

Try this:

data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
iid:chararray, num1:int, num2:int);
grouped = group data by cid;
results = foreach grouped generate FLATTEN(data), SUM(data.num2) as sum;
appended = foreach results generate cid, iid, num1, num2, (sum > 0 ? num1 :
0) as num3;
dump appended;

This will give you:

(a,e,11,0,0)
(b,f,2,2,2)
(c,g,3,3,3)
(c,h,44,44,44)
(c,i,75,0,75)
(d,j,89,0,0)
(d,k,120,0,0)
(d,l,3000,0,0)

Thanks,
Cheolsoo
On Tue, Jan 22, 2013 at 5:17 PM, Uri Laserson <[EMAIL PROTECTED]> wrote:

> I have data that looks like this:
>
> a e 11 0
> b f 2 2
> c g 3 3
> c h 44 44
> c i 75 0
> d j 89 0
> d k 120 0
> d l 3000 0
>
> and I load it like so:
>
> data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> iid:chararray, num1:int, num2:int);
>
> I want to group by the first column, cid.  For each group, if any of the
> num2 values (last column) are positive, I want to output every tuple in
> that group with an extra field equal to num1.  If all the num2 values for
> that group are zero, then I want to output every tuple in that group with
> an extra field equal to 0.
>
> I figured something like this would work:
>
> data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> iid:chararray, num1:int, num2:int);
> grouped = group data by cid;
> results = foreach grouped {
>     result1 = SUM(data.num2);
>     extended = foreach data generate *, result1 > 0 ? num1 : 0;
>     generate FLATTEN(extended);
> };
>
> but it does not.  I get this error:
>
> 2013-01-22 17:15:07,647 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 98, column 48>  mismatched input '>' expecting SEMI_COLON
>
> What is the proper way to do this?  From the MapReduce perspective, I group
> by the key, and in the reducer, I compute a value for each group, and then
> emit every single value for that group along with some extra data.
>
> Thanks!
> Uri
>
>
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> [EMAIL PROTECTED]
>
+
Uri Laserson 2013-01-24, 09:01
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB