Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Aggregations on nested foreach statements


+
Uri Laserson 2013-01-23, 01:17
Copy link to this message
-
Re: Aggregations on nested foreach statements
Hi Uri,

Try this:

data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
iid:chararray, num1:int, num2:int);
grouped = group data by cid;
results = foreach grouped generate FLATTEN(data), SUM(data.num2) as sum;
appended = foreach results generate cid, iid, num1, num2, (sum > 0 ? num1 :
0) as num3;
dump appended;

This will give you:

(a,e,11,0,0)
(b,f,2,2,2)
(c,g,3,3,3)
(c,h,44,44,44)
(c,i,75,0,75)
(d,j,89,0,0)
(d,k,120,0,0)
(d,l,3000,0,0)

Thanks,
Cheolsoo
On Tue, Jan 22, 2013 at 5:17 PM, Uri Laserson <[EMAIL PROTECTED]> wrote:

> I have data that looks like this:
>
> a e 11 0
> b f 2 2
> c g 3 3
> c h 44 44
> c i 75 0
> d j 89 0
> d k 120 0
> d l 3000 0
>
> and I load it like so:
>
> data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> iid:chararray, num1:int, num2:int);
>
> I want to group by the first column, cid.  For each group, if any of the
> num2 values (last column) are positive, I want to output every tuple in
> that group with an extra field equal to num1.  If all the num2 values for
> that group are zero, then I want to output every tuple in that group with
> an extra field equal to 0.
>
> I figured something like this would work:
>
> data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
> iid:chararray, num1:int, num2:int);
> grouped = group data by cid;
> results = foreach grouped {
>     result1 = SUM(data.num2);
>     extended = foreach data generate *, result1 > 0 ? num1 : 0;
>     generate FLATTEN(extended);
> };
>
> but it does not.  I get this error:
>
> 2013-01-22 17:15:07,647 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 98, column 48>  mismatched input '>' expecting SEMI_COLON
>
> What is the proper way to do this?  From the MapReduce perspective, I group
> by the key, and in the reducer, I compute a value for each group, and then
> emit every single value for that group along with some extra data.
>
> Thanks!
> Uri
>
>
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> [EMAIL PROTECTED]
>
+
Uri Laserson 2013-01-24, 09:01