Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Aggregations on nested foreach statements


Copy link to this message
-
Aggregations on nested foreach statements
I have data that looks like this:

a e 11 0
b f 2 2
c g 3 3
c h 44 44
c i 75 0
d j 89 0
d k 120 0
d l 3000 0

and I load it like so:

data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
iid:chararray, num1:int, num2:int);

I want to group by the first column, cid.  For each group, if any of the
num2 values (last column) are positive, I want to output every tuple in
that group with an extra field equal to num1.  If all the num2 values for
that group are zero, then I want to output every tuple in that group with
an extra field equal to 0.

I figured something like this would work:

data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
iid:chararray, num1:int, num2:int);
grouped = group data by cid;
results = foreach grouped {
    result1 = SUM(data.num2);
    extended = foreach data generate *, result1 > 0 ? num1 : 0;
    generate FLATTEN(extended);
};

but it does not.  I get this error:

2013-01-22 17:15:07,647 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 98, column 48>  mismatched input '>' expecting SEMI_COLON

What is the proper way to do this?  From the MapReduce perspective, I group
by the key, and in the reducer, I compute a value for each group, and then
emit every single value for that group along with some extra data.

Thanks!
Uri

--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
[EMAIL PROTECTED]